[AMDGPU] Extend DS loop wait optimization with flush point tracking (#175658)
Add support for prefetch patterns where some DS loads are used in the
same iteration (creating flush points) while others remain unflushed at
the backedge.
This complements the existing pure prefetch optimization (PR172728) by
handling cases where partial same-iteration consumption occurs.
Assisted-by: Cursor / claude-4.5-opus-high