llama.cpp
Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul
#1483

Merged

Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul #1483

ggerganov merged 35 commits into ggml-org:master from JohannesGaessler:gpu-norms

JohannesGaessler requested a review from

slaren 2 years ago

JohannesGaessler added enhancement

JohannesGaessler force pushed from af005ce3 2 years ago

JohannesGaessler force pushed to a272e71d 2 years ago

JohannesGaessler changed the title ~~Norm calculation on GPUs, broadcasting for ggml_mul~~ Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul 2 years ago

JohannesGaessler marked this pull request as draft 2 years ago

JohannesGaessler force pushed from 9acc42f8 2 years ago

Broadcasting for ggml_mul

de65783b

CUDA kernel for ggml_mul, norms in VRAM

2365a2a9

GPU weights not in RAM, direct loading with cuFile

fa1a29f3

JohannesGaessler force pushed to fa1a29f3 2 years ago

slaren commented on 2023-05-18

fixup! GPU weights not in RAM, direct loading with cuFile

1bfe5a98

fixup! GPU weights not in RAM, direct loading with cuFile

24d5ddf6

JohannesGaessler marked this pull request as ready for review 2 years ago

define default model path once, sync path with readme (#1366)

09d82511

~7% faster Q5_1 AVX2 code (#1477)

230018d1

convert.py: Support models which are stored in a single pytorch_model…

1af2844e

benchmark-matmul: Print the average of the test results (#1490)

d5207bf3

Remove unused n_parts parameter (#1509)

d916c5b8

Fixes #1511 lambda issue for w64devkit (mingw) (#1513)

a94b3345

make kv_f16 the default for api users (#1517)

e22541a4

minor : fix compile warnings

6b5776b0

readme : adds WizardLM to the list of supported models (#1485)

75c017fc

main : make reverse prompt option act as a stop token in non-interact…

c51c64a8

examples : add persistent chat (#1495)

0226d491

tests : add missing header

9fd81872

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

211aa6af

ggml : fix scalar implementation of Q4_1 dot

9a7af6c2

llama : fix compile warnings in llama_set_state_data()

f14673ad

llama : fix name shadowing and C4146 (#1526)

df512bbb

Fix for mingw (#1462)

f401d5ff

llama : add llama_init_backend() API (close #1527)

54ec8a96

feature : add blis and other BLAS implementation support (#1502)

667c57f1

Revert "feature : add blis and other BLAS implementation support (#15…

977e74d7

GPU weights not in RAM, direct loading with cuFile

ffe9652b

llama : code style fixes + progress print fix

f67bc3c3

ggml : ggml_mul better broadcast support

3ec7941b

cmake : workarounds for cufile when CMake version < 3.25

a3586c52

Merge branch 'master' into gpu-norms

909acb3e

github-actions commented on 2023-05-20

ggerganov requested changes on 2023-05-20

gg rebase fixup

fee87f65

github-actions commented on 2023-05-20

Loop in llama.cpp, fixed progress callback

b81f662e

github-actions commented on 2023-05-20

Attempt clang-tidy fix

fadcd583

llama : fix vram size computation

a4da072d

ggerganov approved these changes on 2023-05-20

Add forgotten fclose()

37f2c6c2

ggerganov merged affc76ed into master 2 years ago

Reviewers

ggerganov

github-actions

slaren

Assignees

No one assigned

Labels

enhancement

Milestone

No milestone

llama.cpp Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul #1483 Merged

Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul #1483

llama.cpp
Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul
#1483

Merged