[inductor] avoid kernel cache miss because of different arg name (#97755)
We previously use buffer name for the variable containing randomly generated kernel input in the kernel benchmark. This has a big drawback. The same kernel may be used for different buffers. However if we use buffer name as argument name, the kernel source code for different invocation of the kernel will be different. This cause the following downsides:
- compile time will be longer since we can not reused compiled kernel due to cache miss
- this cause inconsistent behavior with TORCHINDUCTOR_BENCHMARK_KERNEL enabled or disabled. We may see more kernels (some are essentially duplicated) in the compiled module if TORCHINDUCTOR_BENCHMARK_KERNEL is enabled.
- this obscure some optimization opportunities. E.g., a kernel spend 6% time is worth looking at. But if the kernel is called 20 times and now it show up as 20 different kernels each spend 0.3% of time, it would be less obvious that we should optimize this kernel.
In this PR, we just use canonical name like `arg_{i}` rather than the buffer name to avoid all the issues above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97755
Approved by: https://github.com/jansel