Vectorize the softmax calculation when not along the last dim (#59195)
Summary:
Currently, if we do softmax which are not along the last dim, the calculation will fall to a [scalar version](https://github.com/pytorch/pytorch/blob/d417a094f398f1c4efd7f818b14b8471a597fbcc/aten/src/ATen/native/SoftMax.cpp#L14-L64). And we find actually we have the chance to vectorize the calculation along the inner_size dim.
Changes we made:
- Use vectorized softmax_kernel instead of host_softmax when not along the last dim.
Performance data on 28 cores' Intel 8280 CPU when the Input size is [32, 81, 15130] and do softmax along the second dim(81).
- FP32 Baseline: 24.67 ms
- FP32 optimized: 9.2 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59195
Reviewed By: ailzhang
Differential Revision: D28854796
Pulled By: cpuhrsch
fbshipit-source-id: 18477acc3963754c59009b1794f080496ae16c3d