In computer architecture, instructions per cycle (IPC) is one aspect of a processor's performance: the average number of instructions executed for each clock cycle. It is the multiplicative inverse of cycles per instruction.[1]

Contents

ExplanationEdit

Calculation of IPCEdit

The calculation of IPC is done through running a set piece of code, calculating the number of machine-level instructions required to complete it, then using high-performance timers to calculate the number of clock cycles required to complete it on the actual hardware. The final result comes from dividing the number of instructions by the number of CPU clock cycles.

The number of instructions per second and floating point operations per second for a processor can be derived by multiplying the number of instructions per cycle with the clock rate (cycles per second given in Hertz) of the processor in question. The number of instructions per second is an approximate indicator of the likely performance of the processor.

The number of instructions executed per clock is not a constant for a given processor; it depends on how the particular software being run interacts with the processor, and indeed the entire machine, particularly the memory hierarchy. However, certain processor features tend to lead to designs that have higher-than-average IPC values; the presence of multiple arithmetic logic units (an ALU is a processor subsystem that can perform elementary arithmetic and logical operations), and short pipelines. When comparing different instruction sets, a simpler instruction set may lead to a higher IPC figure than an implementation of a more complex instruction set using the same chip technology; however, the more complex instruction set may be able to achieve more useful work with fewer instructions.

Factors governing IPCEdit

A given level of instructions per second can be achieved with a high IPC and a low clock speed (like the AMD Athlon and early Intel's Core Series), or from a low IPC and high clock speed (like the Intel Pentium 4 and to a lesser extent the AMD Bulldozer). Both are valid processor designs, and the choice between the two is often dictated by history, engineering constraints, or marketing pressures.[original research?] However, a high IPC with a high frequency will always give the best performance.

Instructions per cycle for various processorsEdit

These numbers are not the IPC value of these CPUs but represent the theoretically possible Floating Point performance. Note that the numbers below only represent the logical widths of the processor's SIMD units. They do not account for the multiple SIMD pipes present in most architectures, nor do they represent the primary architectural definition of IPC, which measures the number of average scalar instructions retired per cycle, both integer, floating point, and control.

To get a theoretical GFLOPS (Billions of FLOPS) rating for a given CPU, multiply the number in this chart by the number of cores and then by the stock clock (in GHz) of a particular CPU model. For example, a Coffee Lake i7-8700K theoretically handles 32 Single-Precision floats per cycle, has 6 cores and a 3.7 GHz base clock. This gives it 32 x 6 x 3.7 = 710.4 GFLOPS.

It is important to note that Multithreading does NOT mean that two threads can operate on the same core simultaneously, sharing pipeline resources. Instead, the CPU allows one thread to use the core whilst another waits for data to arrive from memory, as in the case of a Cache miss. The operating system's scheduler can return the original thread to the queue, and then back into the CPU, once the data has been fetched. Thus, this feature does not have any effect on the theoretical floating point performance of a CPU, but, in certain cases, can help the CPU come closer to that performance, across multiple threads, in practice.

Microarchitecture Dual precision FP64 IPC Single precision FP32 IPC
Intel Atom (Bonnell, Saltwell, Silvermont and Goldmont) 2 4
Intel Core (Merom, Penryn)
Intel Nehalem (Nehalem, Westmere)
4 8
Intel Sandy Bridge (Sandy Bridge, Ivy Bridge) (AVX) 8 16
Intel Haswell (Haswell, Devil's Canyon, Broadwell) (AVX2)
Intel Skylake (Skylake, Kaby Lake, Coffee Lake) (AVX2)
Intel Xeon Phi (Knights Corner) (SSE)
16 32
Intel Skylake-X (AVX-512)
Intel Xeon Phi (Knights Landing, Knights Mill) (AVX-512)
32 64
AMD Bobcat 2 4
AMD Jaguar
AMD Puma
4 8
AMD K10
AMD Bulldozer (Piledriver, Steamroller, Excavator)
6 12
AMD Ryzen
AMD Ryzen 2
8[2] 16[2]
AMD Ryzen 3 (AVX2) 16[3] 32[3]
ARM Cortex-A7, A9, A15 1 8
ARM Cortex-A32, A35, A53, A57, A72 2 8
Qualcomm Krait 1 8
Qualcomm Kryo 2 8
IBM PowerPC A2 (Blue Gene/Q) 8 8 (SP elements are extended to DP
and processed on the same units)
Nvidia Fermi (only GeForce GTX 465-480, 560 Ti, 570-590) 1/4 (locked by driver, 1 in hardware) 2
Nvidia Fermi (only Quadro 600-2000)) 1/8 2
Nvidia Fermi (only Quadro 4000-7000), Tesla)
Nvidia Pascal (only Quadro GP100 and Tesla P100)
Nvidia Volta
1 2
Nvidia Kepler (GeForce (except GeForce Titan and Titan Black), Quadro (except Quadro K6000), Tesla K10) 1/12 (for GK110: locked by driver, 2/3 in hardware) 2
Nvidia Kepler (GeForce GTX Titan and Titan Black, Quadro K6000, Tesla (except Tesla K10)) 2/3 2
Nvidia Maxwell
Nvidia Pascal (all except Quadro GP100 and Tesla P100)
Nvidia Turing
1/16 2
AMD GCN (all except Radeon VII, Instinct MI50 and MI60) 1/8 2
AMD GCN Vega 20 (only Radeon VII) 1/2 (locked by driver, 1 in hardware) 2
AMD GCN Vega 20 (only Radeon Instinct MI50 and MI60) 1 2
AMD RDNA 1/8 (?) 2

Computer speedEdit

The useful work that can be done with any computer depends on many factors besides the processor speed. These factors include the instruction set architecture, the processor's microarchitecture, and the computer system organization (such as the design of the disk storage system and the capabilities and performance of other attached devices), the efficiency of the operating system, and most importantly the high-level design of the application software in use.

For users and purchasers of a computer system, instructions per clock is not a particularly useful indication of the performance of their system. For an accurate measure of performance relevant to them, application benchmarks are much more useful. Awareness of its existence is useful, in that it provides an easy-to-grasp example of why clock speed is not the only factor relevant to computer performance.

See alsoEdit

ReferencesEdit

  1. ^ John L. Hennessy, David A. Patterson. "Computer architecture: a quantitative approach". 2007.
  2. ^ a b "Agner`s CPU blog". www.agner.org.
  3. ^ a b "AMD CEO Lisa Su's COMPUTEX 2019 Keynote". www.youtube.com.