llama.cpp
0ed235ea - [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)

Commit

2 days ago

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057) * [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel. This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. * Add new tests that execute the new optimized strided copy path * Return unsupported for strided copy in OpenVINO, as new tests are failing

References

#25057 - [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy

Author

gaugarg-nv

Parents

9bebfcb4

llama.cpp 0ed235ea - [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)

llama.cpp
0ed235ea - [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)