[AMDGPU][SIInsertWaitcnt] Implement Waitcnt Expansion for Profiling (#169345)
Reference issue: https://github.com/ROCm/llvm-project/issues/67
This patch adds support for expanding s_waitcnt instructions into
sequences with decreasing counter values, enabling PC-sampling profilers
to identify which specific memory operation is causing a stall.
This is controlled via:
Clang flag: -mamdgpu-expand-waitcnt-profiling /
-mno-amdgpu-expand-waitcnt-profiling
Function attribute: "amdgpu-expand-waitcnt-profiling"
When enabled, instead of emitting a single waitcnt, the pass generates a
sequence that waits for each outstanding operation individually. For
example, if there are 5 outstanding memory operations and the target is
to wait until 2 remain:
**Original**:
s_waitcnt vmcnt(2)
**Expanded**:
s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)
The expansion starts from (Outstanding - 1) down to the target value,
since waitcnt(Outstanding) would be a no-op (the counter is already at
that value).
- Uses ScoreBrackets to determine the actual number of outstanding
operations
- Only expands when operations complete in-order
- Skips expansion for mixed event types (e.g., LDS+SMEM on same counter)
- Skips expansion for scalar memory (always out-of-order)
Releated previous work for Reference
- **PR**: llvm/llvm-project#79236 (related `-amdgpu-waitcnt-forcezero`)
---------
Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>