GPU-accelerated token generation (new quantization format) #1412
CUDA kernel for q4_0 dequant. + mat. vec. mult.
637be12f
Added q4_1 via template
12fc292e
Added missing __syncthreads();
7dc2f57e
--gpu_layers -> --gpu-layers
f0af4757
slaren
commented
on 2023-05-12
Shorter dequantize_mul_mat_vec line
0986c2f4
q5_0 dequantize_mul_mat kernel
9da44fdc
More readable dequantize_mul_mat_vec logic
5a0ecf76
dequantize_mul_mat_vec kernels for q5_1, q8_0, f16
bb0993ed
llama : offload "output" tensor to GPU too + coding style fixes
ad8a9e69
ggerganov
approved these changes
on 2023-05-13
ggerganov
merged
905d87b7
into master 2 years ago
slaren
commented
on 2023-05-13
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub