[inductor] Use 64-bit indexing for large tensors in triton code (#97447)
This changes `TritonKernel` to have an `index_dtype` property which is
used as the dtype in indexing calculations. By default it is
`tl.int32` but if any input or output buffer is larger than `INT_MAX`
then we use `tl.int64` instead.
should fix #96978, #93606 (need to double check)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97447
Approved by: https://github.com/ngimel