llama.cpp
230d1169 - improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)

Commit

2 days ago

improve CUDA cpy memory bandwidth when copying transposed tensor (#16841) * WIP * added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth * added BF16 support * more strict check to make sure src0 is a transpose * reformulated to handle more complicated transpose cases * bring back 2D transpose for higher performance * allow build on windows * tranpose copy more shapes * minor tweak * final clean up * restore some test cases * keep only the kernel for true tranposed case; updated with review suggestions * make CI happy * remove headers not needed * reduced bank conflicts for fp16 and bf16 * add missing const* * now bank conflicts free * use padding instead of swizzling --------- Co-authored-by: bssrdf <bssrdf@gmail.com>

References

#16841 - improve CUDA cpy memory bandwidth when copying transposed tensor

Author

bssrdf

Parents

a44d7712

llama.cpp 230d1169 - improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)

llama.cpp
230d1169 - improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)