[AMDGPU] DS loop wait relaxation -- more test cases and improvements to handle them (4/4)
Add handling for same-iteration use/overwrite of DS load results:
- Track DS load destinations and detect when results are used or
overwritten within the same iteration
- Compute FloorWaitCount for WMMAs that only use flushed loads
Add bailout for tensor_load_to_lds and LDS DMA writes after barrier
Add negative test based on profitability criteria
Assisted-by: Cursor / claude-4.5-opus-high