[AMDGPU] Enable runtime loop unrolling (#194924)
Enable auto runtime unrolling for AMDGPU by setting `UP.Runtime = true`
in `getUnrollingPreferences`, with `PartialThreshold = Threshold / 4` to
limit code-size growth.
Benchmarked on **MI350X (gfx950)** and **MI300X (gfx942)** using
Composable Kernel, xpu-perf, and llama.cpp. Results showed some some
improvements and no real regressions.
AI Disclaimer: Cursor was used to evaluate the change and run
benchmarking experiments.