huggingface/text-generation-inference

Pull Requests Commits

Adding GPTQ integration tests.

Ubuntu committed 3 years ago

dca0fe25

Santacoder GPTQ support (quantized model seems awful, not sure if it's

Narsil committed 3 years ago

16d0fb04

Narsil committed 3 years ago

983c813f

Triton is actually a dependency of torch on linux.

Narsil committed 3 years ago

054a3d09

Remove lots of dead code, move triton to hard requirement

Narsil committed 3 years ago

732da694

No one saw that, therefore it didn't happen.

Narsil committed 3 years ago

5de68637

Tiny fixes for falcon.

Narsil committed 3 years ago

55cf4d25

Narsil committed 3 years ago

e5e552b4

Fixing register bias + gptq_bits type.

Ubuntu committed 3 years ago

ee1f94e6

Fixing few things

Ubuntu committed 3 years ago

ffe8fc46

Ubuntu committed 3 years ago

dadbbc27

Re-enabling dim=dim in TensorParallelColumn because llama.

Ubuntu committed 3 years ago

3fb8979a

Ubuntu committed 3 years ago

ae308f88

Functionning quantization script.

Ubuntu committed 3 years ago

a0a194c3

Adding quantization scripts.

Ubuntu committed 3 years ago

5a727153

Narsil committed 3 years ago

da8ebf16

Fixing the dockerfile (require triton + gcc for compiling).

Ubuntu committed 3 years ago

0b585921

Removing dead code.

Ubuntu committed 3 years ago

92f85c96

[WIP] Inference support for GPTQ (llama at least)

Ubuntu committed 3 years ago

9a12941b

feat(server): pre-allocate past key values for flash causal LM (#412)

OlivierDehaene committed 3 years ago

Verified 5ce89059

fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)

piratos committed 3 years ago

Verified ca650e5b

docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)

antferdom committed 3 years ago

Verified d4eb60f4

feat(server): optimize dist ops (#434)

OlivierDehaene committed 3 years ago

Verified e496c9ba

feat(server): Rework model loading (#344)

Narsil committed 3 years ago

Verified abd58ff8

chore: update openapi schema

OlivierDehaene committed 3 years ago

19c41824

feat(server): batch tokenization for flash causal lm (#411)

OlivierDehaene committed 3 years ago

Verified 6abec14a

feat(server): only compute prefill logprobs when asked (#406)

OlivierDehaene committed 3 years ago

Verified 895c5f15

feat(launcher): parse oom signal (#404)

OlivierDehaene committed 3 years ago

Verified 83b84486

feat(sagemaker): add trust remote code to entrypoint (#394)

OlivierDehaene committed 3 years ago

Verified 62fc4010

OlivierDehaene committed 3 years ago

e7248fe9

Older