Add FusedLinearCrossEntropy (#2485)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/136168, I'm going to migrate implementations of operator benchmarking. This PR adds different implementations for FusedLinearCrossEntropy as a starting example.
Execution command:
```
python run_benchmark.py triton --op FusedLinearCrossEntropy
```
Example output:
```
x_val LMHeadCE-latency LigerLMHeadCE-latency inductor_fused_linear_cross_entropy-latency
------- ------------------ ----------------------- ---------------------------------------------
0 98.0041 389.87 95.0412
1 196.12 652.619 193.219
2 417.242 1248.75 416.725
3 824.906 2356.25 809.56
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2485
Reviewed By: xuzhao9
Differential Revision: D63859871
Pulled By: FindHao
fbshipit-source-id: 4b73a2144702c1f8f3ae5ed15e76112d03f12b87