add bfloat16 support for kl_div_backward_cuda (#77676)
This PR adds a feature requested in issue #77375.
`kl_div_backward_cuda` now supports `bfloat16`
cc @ngimel @ptrblck @rosrad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77676
Approved by: https://github.com/jbschlosser