[AMDGPU] Align loop headers to prevent instruction fetch split on GFX950 (#181999)
On GFX9, the instruction sequencer fetches 32 bytes at a time. When an
8-byte instruction at a loop header straddles a 32-byte fetch window
boundary, the sequencer must perform two fetches after a backward
branch, incurring a delay. On GFX950, this causes additional performance
issues.
This patch adds 32-byte alignment (.p2align 5, , 4) for loop headers on
GFX950 when the first real instruction is 8 bytes. At most one s_nop (4
bytes, 1 quad-cycle before the loop) is used for padding. If more than 4
bytes of padding were needed, the 8-byte instruction would not straddle
a 32-byte boundary anyway, so alignment is skipped.
Note: the alignment decision is made during block-placement, before
si-insert-waitcnts. In loops where a 4-byte S_WAITCNT is later inserted
as the first instruction, the alignment becomes redundant but mostly
harmless (at most one extra s_nop per affected loop).
Assisted-by: Claude (Anthropic)