onnxruntime
33e06be4 - optimize transpose CUDA kernel (#4233)

Commit
5 years ago
optimize transpose CUDA kernel (#4233) * optimize transpose * optimize for the case when the tensor is 3D and the permutation is done in last two dimension. BERT-L throughput is improved ~1.4% from transpose optimization * fix UT MegatronSelfAttentionPartitionCorrectnessTest * polish code. * add test and change tile size to 16x16 for better perf. * fix UT * fix test of mask_rcnn * address code review comments. Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Author
Parents
Loading