Fix masked_softmax's perf for element_size is not 8 (#70271)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70271
Test Plan:
Rebase on top of D32407544 and
buck run mode/opt -c fbcode.enable_gpu_sections=true pytext/fb/tools:benchmark_masked_softmax -- masked-softmax --batch-size=10
to see correct perf data ( PT time = ~2.5x PT native time )
Reviewed By: ngimel
Differential Revision: D33268055
fbshipit-source-id: f48b17852c19c2bc646f9ed8d9d5aac85caa8a05