[MLIR][AMDGPU] Add amdgpu.global_transpose_load op for gfx1200+ global memory transpose loads (#195287)
Adds a new `amdgpu.global_transpose_load` op to the AMDGPU dialect that
wraps the `global_load_tr` family of instructions introduced in RDNA4
(gfx1250+). Each thread reads a column of a matrix from global memory
and receives the corresponding transposed row in its result register.
The op is kept separate from the existing `amdgpu.transpose_load` (which
targets LDS via `ds_read_tr` on gfx950+) because the two variants target
different GPU architecture families, have different chipset
requirements, and differ in their valid (element size, num elements)
combinations — in particular the 16-bit case produces a 128-bit
(8-element) result via `global_load_tr.b128` rather than the 64-bit
(4-element) result from `ds_read_tr16.b64`.
Lowering to the existing ROCDL `global.load.tr{4,6,.}.b{64,96,128}`
intrinsics added for gfx1200+.
---------
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>