avoid explicitly casting low precision inputs to fp32 in norm (#59134)
Summary:
Per title. Now `norm` with fp16/bfloat16 inputs and fp32 outputs on cuda won't do explicit cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59134
Reviewed By: mruberry
Differential Revision: D28775729
Pulled By: ngimel
fbshipit-source-id: 896daa4f02e8a817cb7cb99ae8a93c02fa8dd5e9