llama.cpp
0ed235ea - [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)

Commit
2 days ago
[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057) * [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel. This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. * Add new tests that execute the new optimized strided copy path * Return unsupported for strided copy in OpenVINO, as new tests are failing
Author
Parents
Loading