llama : add Mixtral support #4406
convert : support Mixtral as LLAMA arch
dff8cbeb
convert : fix n_ff typo
d38e41ee
llama : model loading
a3eefe95
ggml : sync latest ggml_mul_mat_id
861cd678
llama : update graph to support MoE
aedfad12
llama : fix cur -> cur_expert
af1a096b
llama : first working version
7ea36953
llama : fix expert weighting in the FFN
8b185b70
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu o…
7372b622
ggml : add n_as argument to ggml_mul_mat_id
ee8fb399
ggml : fix ggml_get_rows to take into account ne02 / ne11
9064b1ca
metal : add more general support for ggml_get_rows + tests
2cbcba82
llama : add basic support for offloading moe with CUDA
06dfde3e
metal : add/mul/div use general kernel when src1 not cont
7e2006b0
metal : reduce the kernel launches for ggml_mul_mat_id
8c5b66ee
ggml : get_rows : support non-contiguos tensors with gaps, generalize…
ac3f7d8e
ggml : update get_rows f16 and q
2e4db482
cuda : support non-contiguous src1 in get_rows
62b95f93
llama : offload missing ffn_moe_silu
0710b0f7
metal : fix ggml_get_rows to work with non-cont src1
016f9bb5
metal : add indirect mat-vec kernels for all quantization types
6cfb31f9
llama : do not quantize expert gating tensors
d1259b7b
llama : add n_expert and n_expert_used to hparams + change quants
e640cbe0
test-backend-ops : add moe test
cefebb36
cuda : fix get_rows when ncols is odd
8614aa73
convert : determine n_ctx correctly
65923a8e
metal : fix ggml_mul_mat_id for F32
b0b83dd9
test-backend-ops : make experts more evenly probable (test_moe)
54ba2634
test-backend-ops : cleanup, add moe test for batches
54d254bb
test-backend-ops : add cpy from f32 -> all types test
f1380d78
test-backend-ops : fix dequantize block offset
b0029815
llama : fix hard-coded number of experts
8cbaed1d
test-backend-ops : simplify and disable slow tests to avoid CI timeout
ffda94c8
test-backend-ops : disable MOE test with thread sanitizer
33e50f1b
deniaud
approved these changes
on 2023-12-11
cuda : fix mul_mat_id with multi gpu
296c945d
convert : use 1e6 rope_freq_base for mixtral
7dc75e39
convert : fix style
f1cbfabd
convert : support safetensors format
6a419f4d
gguf-py : bump version
a742d9f9
metal : add cpy f16 -> f32 kernel
08eb9917
metal : fix binary ops for ne10 % 4 != 0
a51bc0c1
test-backend-ops : add one more sum_rows test
ea4402bb
ggml : do not use BLAS with ggml_mul_mat_id
90c12e6b
convert-hf : support for mixtral-instruct (#4428)
82e4f645
metal : fix soft_max kernels
ab558ac2
metal : limit kernels to not use more than the allowed threads
109e7aa8
metal : switch to execution barriers + fix one of the barriers
e1241d9b
ggerganov
approved these changes
on 2023-12-13
ggerganov
merged
799a1cb1
into master 2 years ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub