[ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746)
There will be some changes in CUDA 12.4 that would require smaller number of threads per block with double precision in `ctc_loss`. This PR addresses the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120746
Approved by: https://github.com/ptrblck, https://github.com/janeyx99