Add dot implementation for BFloat16 on CUDA (#57903)
Summary:
Enabled `dot` for BFloat16 on CUDA (version 11+).
It also enabled `matmul` & `vdot` for BFloat16.
Backward for `matmul` isn't supported for `BFloat16`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57903
Reviewed By: mruberry
Differential Revision: D28346031
Pulled By: ngimel
fbshipit-source-id: 0917e9e0d6cf3694f45fe1c7e76370581502036a