llama.cpp
bd90fc74 - ggml-webgpu: improve flastAttention performance by software pipelining (#19151)

Commit

36 days ago

ggml-webgpu: improve flastAttention performance by software pipelining (#19151) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll Q*K accumlation inner loop * ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

References

#19151 - ggml-webgpu: improve flastAttention performance by software pipelining

Author

ArberSephirotheca

Parents

ce38a4db

llama.cpp bd90fc74 - ggml-webgpu: improve flastAttention performance by software pipelining (#19151)

llama.cpp
bd90fc74 - ggml-webgpu: improve flastAttention performance by software pipelining (#19151)