Invoke the bf16 load w/o #elements to bypass the temporary buffer allocation from the performance perspective. (#99822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99822
Approved by: https://github.com/jgong5, https://github.com/jansel