llvm-project
b595849e - [AMDGPU]: Rewrite mbcnt_lo/mbcnt_hi to work item ID where applicable (#160496)

Commit
86 days ago
[AMDGPU]: Rewrite mbcnt_lo/mbcnt_hi to work item ID where applicable (#160496) This PR aims to optimize `llvm.amdgcn.mbcnt.lo` and `llvm.amdgcn.mbcnt.hi` intrinsic patterns into simpler `workitem.id.x` operations when work group sizes are known at compile time for **improving performance of lane ID calculations**. **visitMbcntLo:** - Simple replacement: When `workgroup_size == wave_size` → Replace with `workitem.id.x` - Bitmask optimization: When work group evenly splits into waves → Replace with `workitem.id.x & (wave_size - 1)` **visitMbcntHi:** - Copy optimization: On wave32, `mbcnt.hi(mask, val)` → `val` (upper 32 bits are always 0) - Full pattern optimization: `mbcnt.hi(~0, mbcnt.lo(~0, 0))` → Replace with `workitem.id.x` **Example 1: Simple Replacement** ```llvm ; Before: %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0); With workgroup_size = 32 ; After: %a = call i32 @llvm.amdgcn.workitem.id.x() ``` **Example 2: Bitmask Optimization** ```llvm ; Before: %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0); With workgroup_size = 64 (2 waves) ; After: %tid = call i32 @llvm.amdgcn.workitem.id.x() %a = and i32 %tid, 31 ``` **Example 3: Copy Optimization** ```llvm ; Before: %a = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %val); With workgroup_size = 32 ret i32 %a ; After: ret i32 %val ``` **Example 4: Full Pattern Optimization** ```llvm ; Before: %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a); With workgroup_size = 64 ; After: %b = call i32 @llvm.amdgcn.workitem.id.x() ; Before: %a = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %b = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %a); With workgroup_size = 48 (Partial mask) ; After: %tid = call i32 @llvm.amdgcn.workitem.id.x() %b = and i32 %tid, 31 ```
Parents
Loading