Express y grid > 2^16 in terms of z grid (#121554)
CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554
Approved by: https://github.com/aakhundov