Make CUDA compile with QK_K = 64

Commit

2 years ago

Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access