Avoid Triton 256MB autotune cache allocation (#184479)
Summary:
Use an Inductor-owned CUDA benchmarking path with an L2-sized cache buffer when Triton do_bench would otherwise allocate its fixed 256MB cache-clearing tensor, while preserving fallback behavior for non-CUDA/HIP and unsupported benchmark options. Add focused regression coverage for the cache size and active-device handling.
Fixes #93509
Generated by my agent
X-link: https://github.com/pytorch/pytorch/pull/184479
Approved by: https://github.com/eellison
Reviewed By: atalman
Differential Revision: D107442346
fbshipit-source-id: 80db6e232d8ae3bf27c8752f16af434d7dc59440