optimize softmax backward and logsoftmax backward (#80114)
Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](https://github.com/pytorch/pytorch/blob/32593ef2dd26e32ed44d3c03d3f5de4a42eb149a/aten/src/ATen/native/SoftMax.cpp#L220-L287). And we find actually we have the chance to vectorize the calculation along the inner_size dim.
Changes we made:
Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim.
We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz.
Number of cores: 24 cores(1 socket)
[softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log)
[softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80114
Approved by: https://github.com/frank-wei