benchmark
d0d0d40c - Avoid Triton 256MB autotune cache allocation (#184479)

Commit

24 days ago

Avoid Triton 256MB autotune cache allocation (#184479) Summary: Use an Inductor-owned CUDA benchmarking path with an L2-sized cache buffer when Triton do_bench would otherwise allocate its fixed 256MB cache-clearing tensor, while preserving fallback behavior for non-CUDA/HIP and unsupported benchmark options. Add focused regression coverage for the cache size and active-device handling. Fixes #93509 Generated by my agent X-link: https://github.com/pytorch/pytorch/pull/184479 Approved by: https://github.com/eellison Reviewed By: atalman Differential Revision: D107442346 fbshipit-source-id: 80db6e232d8ae3bf27c8752f16af434d7dc59440

Author

jansel

Committer

meta-codesync[bot]

Parents

c708a21d

benchmark d0d0d40c - Avoid Triton 256MB autotune cache allocation (#184479)

benchmark
d0d0d40c - Avoid Triton 256MB autotune cache allocation (#184479)