Softmax interface update (#12469)
* Template datatype for SoftmaxWithRawMaskSmallKernel in ROCm EP
* Remove valid_items usage from SoftmaxWithRawMaskSmallKernel for ROCm EP
The kernel already masks off invalid items and this gives a much
faster implementation in hipCUB.
* Update accumulator type in ROCm EP for SoftmaxWithRawMaskSmallKernel
Hard code accumulator to fp32 for hipCUB in indicated kernel.
* Reset casting to old behavior
* Document steps to optimize SoftMax kernel on ROCm EP
Usage of the hipCUB valid_items interface on reduction operations
has a significant performance impact. Masking all thread data to
avoid need to use the valid_items interface to hipCUB.