llvm-project
bda0016a - [MLIR][AMDGPU] Add amdgpu.global_transpose_load op for gfx1200+ global memory transpose loads (#195287)

Commit
11 days ago
[MLIR][AMDGPU] Add amdgpu.global_transpose_load op for gfx1200+ global memory transpose loads (#195287) Adds a new `amdgpu.global_transpose_load` op to the AMDGPU dialect that wraps the `global_load_tr` family of instructions introduced in RDNA4 (gfx1250+). Each thread reads a column of a matrix from global memory and receives the corresponding transposed row in its result register. The op is kept separate from the existing `amdgpu.transpose_load` (which targets LDS via `ds_read_tr` on gfx950+) because the two variants target different GPU architecture families, have different chipset requirements, and differ in their valid (element size, num elements) combinations — in particular the 16-bit case produces a 128-bit (8-element) result via `global_load_tr.b128` rather than the 64-bit (4-element) result from `ds_read_tr16.b64`. Lowering to the existing ROCDL `global.load.tr{4,6,.}.b{64,96,128}` intrinsics added for gfx1200+. --------- Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Parents
Loading