Instruction fusion to vector code

7/6/2023

Micro-operations are those operations that can be executed in 1 clock cycle. TODO: check whether gcc and clang's choices are optimal with immediate vs. Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. (The cmp dword ,0 does micro-fuse, though: uops_issued.any:u is 1 lower than uops_executed.thread, and the loop contains no nop or other "eliminated" instructions, or any other memory instructions that could micro-fuse). That sub with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell). (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, that can keep indexed addressing modes micro-fused. Reordering so a mov ecx,1 instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).Ĭmp, eax/ jne only macro-fuses not micro. Reordering so one of those mov instructions split up the cmp/jcc didn't change perf counters for fused-domain or unfused-domain uops.īut cmp ,eax/ jnz does macro- and micro-fuse. I tested with a loop that contained some dummy mov ecx,1 instructions. On Skylake, cmp dword, 0/ jnz can't macro-fuse. after, and note that uops_issued.any:u and uops_executed.thread both go up by 1 per loop iteration because we defeated macro-fusion. You can verify this with perf counters by putting a mov ecx, 1 in between the CMP and JCC vs. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz). The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword, 1 / jnz can macro-fuse but not micro-fuse.Ī cmp/ jcc on SnB-family (like cmp, edx / jnz) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from can micro-fuse with that uop.įailure to micro-fuse the cmp does not prevent macro-fusion. See Micro fusion and addressing modesīoth can happen at the same time cmp, eax

But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.Īnd Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.) And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. But they still have to dispatch separately to separate execution units. Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline.

( x86_64 - Assembly - loop conditions and out of order) Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add/ sub or inc/ dec + JCC with some conditions. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1 / or dst,src2 makes sense. This saves uop cache space, and bandwidth everywhere including decode. The rest of the pipeline sees it purely as a single uop 1 (except performance counters still count it as 2 instructions). Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion. No, fusion is totally separate from how one complex instruction (like cpuid or lock add, eax) can decode to multiple uops.

0 Comments

Instruction fusion to vector code

Leave a Reply.

Author

Archives

Categories