add BFloat16 operators on CPU: diag, fmod, cumsum, cumprod (#61897)
Added BFloat16 support for searchsorted, diag, fmod, cumsum, and cumprod on CPU, and collected the benchmark data of these OPs (searchsorted, diag, fmod, cumsum, cumprod) for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
Number of cores: 1 core, 28 cores(1 socket)
[cumsum_cumprod_benchmark.txt](https://github.com/pytorch/pytorch/files/6980232/cumsum_cumprod_benchmark.txt)
[diag_benchmark.txt](https://github.com/pytorch/pytorch/files/6980233/diag_benchmark.txt)
[fmod_benchmark.txt](https://github.com/pytorch/pytorch/files/6980234/fmod_benchmark.txt)
[searchsorted_benchmark.txt](https://github.com/pytorch/pytorch/files/6980328/searchsorted_benchmark.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61897
Approved by: https://github.com/VitalyFedyunin