optimize transpose CUDA kernel (#4233)
* optimize transpose
* optimize for the case when the tensor is 3D and the permutation is done in last two dimension.
BERT-L throughput is improved ~1.4% from transpose optimization
* fix UT MegatronSelfAttentionPartitionCorrectnessTest
* polish code.
* add test and change tile size to 16x16 for better perf.
* fix UT
* fix test of mask_rcnn
* address code review comments.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>