optimize transpose CUDA kernel (#4233)

Commit

5 years ago

optimize transpose CUDA kernel (#4233) * optimize transpose * optimize for the case when the tensor is 3D and the permutation is done in last two dimension. BERT-L throughput is improved ~1.4% from transpose optimization * fix UT MegatronSelfAttentionPartitionCorrectnessTest * polish code. * add test and change tile size to 16x16 for better perf. * fix UT * fix test of mask_rcnn * address code review comments. Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

References

#4233 - optimize transpose CUDA kernel

Author

weixingzhang

Parents

dba22b17

onnxruntime 33e06be4 - optimize transpose CUDA kernel (#4233)

onnxruntime
33e06be4 - optimize transpose CUDA kernel (#4233)