Optimize sparse.mm reduce in BFloat16 data type in CPU backend (#103239)
### Description
This PR is to optimize sparse.mm reduce of BFloat16 data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Half support (need support addmm Half implementation) will be done once https://github.com/pytorch/pytorch/pull/99498 upstream.
Next step:
- [x] Add benchmarks
- [x] Update UTs
- [x] Check backward behaviors
- [x] Refactor code
### Performance test (Updated)
Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp
Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/509e8482-9160-4b85-bc39-5b6aad510283)
Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/c953a494-8f8e-4dbd-a8a7-421d8c22e946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103239
Approved by: https://github.com/mingfeima, https://github.com/albanD