Speed up LossCTC.cu (#97269)
For these two kernels, `grid.x == 1` is enough. `grid.x > 1` leads to repeated computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97269
Approved by: https://github.com/ngimel, https://github.com/malfet