Enabled bfloat16 for cuda (#27259)
Summary:
Enabled basic support for bfloat16 on cuda
Tested via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27259
Differential Revision: D17728661
Pulled By: izdeby
fbshipit-source-id: 99efb6bc4aec029fe6bbc8a68963dca9c9dc5810