[AMDGPU]: Rewrite mbcnt_lo/mbcnt_hi to work item ID where applicable (#160496)
This PR aims to optimize `llvm.amdgcn.mbcnt.lo` and
`llvm.amdgcn.mbcnt.hi` intrinsic patterns into simpler `workitem.id.x`
operations when work group sizes are known at compile time for
**improving performance of lane ID calculations**.
**visitMbcntLo:**
- Simple replacement: When `workgroup_size == wave_size` → Replace with
`workitem.id.x`
- Bitmask optimization: When work group evenly splits into waves →
Replace with `workitem.id.x & (wave_size - 1)`
**visitMbcntHi:**
- Copy optimization: On wave32, `mbcnt.hi(mask, val)` → `val` (upper 32
bits are always 0)
- Full pattern optimization: `mbcnt.hi(~0, mbcnt.lo(~0, 0))` → Replace
with `workitem.id.x`
**Example 1: Simple Replacement**
```llvm
; Before:
%a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0); With workgroup_size = 32
; After:
%a = call i32 @llvm.amdgcn.workitem.id.x()
```
**Example 2: Bitmask Optimization**
```llvm
; Before:
%a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0); With workgroup_size = 64 (2 waves)
; After:
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%a = and i32 %tid, 31
```
**Example 3: Copy Optimization**
```llvm
; Before:
%a = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %val); With workgroup_size = 32
ret i32 %a
; After:
ret i32 %val
```
**Example 4: Full Pattern Optimization**
```llvm
; Before:
%a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a); With workgroup_size = 64
; After:
%b = call i32 @llvm.amdgcn.workitem.id.x()
; Before:
%a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a); With workgroup_size = 48 (Partial mask)
; After:
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%b = and i32 %tid, 31
```