EmbeddingBag sort thrust->cub (#64498)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/57505
Also fixes a warning I found when compiling:
```
/home/gaoxiang/pytorch-cub/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu(7): warning: inline qualifier ignored for "__global__" function
```
I also updated the bfloat16 guard to CUDA 11.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64498
Reviewed By: mruberry
Differential Revision: D30917077
Pulled By: ngimel
fbshipit-source-id: fb9df08fd469038478a563014b5af7452b4b28c0