[fuser] Support bfloat16 (#54571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54571
Supports bfloat16 via a similar method to half: upconvert inputs to
fp32, do math, then downconvert outputs to bf16.
Resource strings are mostly derived from cuda-11 headers.
Fixes #53918, for the legacy fuser at least.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27328987
Pulled By: bertmaher
fbshipit-source-id: 5c0eae44164623faa0c75cb818e8bf0211579fdc