- These lists were created by AIDA64 Instruction Latency dump feature. If you do not believe in software measurements, wait for the official Intel/AMD/etc. guide and hope it will be more detailed and accurate than the current one. ;) You can create such dump in AIDA64 by right-clicking on the bottom status bar of AIDA64 main window -> CPU Debug -> Instruction Latency Dump. It fully works on trial version, too.
- In this dump latency means the time that it takes for the next dependent same-type instruction to start. Throughput means the time that it takes for the next independent same-type instruction to start:
L: ADD rax, rax T: ADD rax, rax
ADD rax, rax ADD rbx, rbx
ADD rax, rax ADD rcx, rcx
... ...
- These values are measured by long chains of instructions (~6000), so these are the sustained rates, peak values can be higher.
- Some instructions do not modify the target register. E.g.
CMP, TEST, BT, NOP.
T
his way it is not always possible to measure directly the instruction latency.
- Some instructions never depend on a previous one: they use different source and destination register sets or have memory operand, so it is not always possible to measure directly the instruction latency, but it possible to measure instruction
pairs. E.g.
PUSH + POP, MOV reg, [mem] + MOV [mem], reg
. The abbreviation "LS pair
" means a load and a store form pair of a moving instruction.
- Newer processors can recognize that some instructions with the same operand are independent from previous ones. In this case latency can be lower than 1. Classic example is the XOR instruction:
XOR eax, eax
always 0 so it never depends
on the result of the previous XOR. XOR r32_1, r32_2
means
L: XOR rax, rbx T: XOR rax, rbx
XOR rax, rbx XOR rbx, rcx
XOR rax, rbx XOR rcx, rdx
... ...
- If TP value is less than 1, it means that more than one same-type instruction can start in the same clock cycle.
- In case of memory operand, throughput can be higher than latency, because it uses more memory location than latency measurement.
- FSQRT throughput can be higher than FSQRT latency on older processors because FSQRT is measured via
L: FSQRT T: FSQRT
FSQRT FDECSTP
FSQRT FSQRT
FSQRT FDECSTP
... ...
chains and the oldies cannot do FSQRT and FDECSTP parallel.Update: FDECSTP changed to FXCH.
- The (I)DIV latency on modern processors depends on the operand size. Because
(I)DIV always uses rDX:rAX registers for dividend, quotient and remainder, and only for some operand sizes is
possible to dividend = quotient : remainder (e.g.
If AX = 0xFEFF, after an
DIV AL
AX remains 0xFEFF) , need to refresh rDX/rAX. So "DIV r8 12/ 8b ax upd
" means
L: DIV al T: DIV bl
MOV ax, const MOV ax, const
DIV al DIV cl
MOV ax, const MOV ax, const
... ...
chains. Similarly "DIV r32 2^62/2^31 eax/edx
" means
L: DIV eax T: DIV ebx
MOV eax, const1 MOV eax, const1
MOV edx, const2 MOV edx, const2
DIV eax DIV ecx
MOV eax, const1 MOV eax, const1
MOV edx, const2 MOV edx, const2
... ...
- For some x87 instruction combinations (and for some SSE in 32b mode) the 8 registers are not enough to measure the instruction throughput.
- It is a measurement, not a constant table, so some values are rounded.
- Keep in mind that even though instruction latency and throughput are important, they may not directly reflect CPU performance!