vllm
Fix for attention layers to remain unquantized during moe_wn16 quant
#12570

Merged

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

youkaichao merged 42 commits into vllm-project:main from srikanthsrnvs:fix-moe-wna16-attention

srikanthsrnvs requested a review from

mgoin 1 year ago

srikanthsrnvs requested a review from

robertgshaw2-redhat 1 year ago

srikanthsrnvs requested a review from

tlrmchlsmth 1 year ago

mgoin approved these changes on 2025-01-31

mgoin added quantization

mgoin added ready

Fix for attention layers to remain unquantized during moe_wn16 quant …

483b60c0

Set `?device={device}` when changing tab in installation guides (#12560)

915fdce8

[Misc] fix typo: add missing space in lora adapter error message (#12…

d689505a

[Kernel] Triton Configs for Fp8 Block Quantization (#11589)

689bd199

[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)

f7a4e122

[V1][Log] Add max request concurrency log to V1 (#12569)

95b49be3

[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) s…

b0d72881

[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)

9813962a

[Attention] MLA decode optimizations (#12528)

897c8c24

[Bugfix] Gracefully handle huggingface hub http error (#12571)

c4795ce0

Format

a5e6700c

Add favicon to docs (#12611)

1ce860be

[BugFix] Fix Torch.Compile For DeepSeek (#12594)

bc9d8314

[Git] Automatically sign-off commits (#12595)

22b918de

[Docs][V1] Prefix caching design (#12598)

00df0e4b

[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)

44fa70d9

[release] Add input step to ask for Release version (#12631)

fdd86fbb

[Bugfix] Revert MoE Triton Config Default (#12629)

c4a7c261

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for …

e7c98c61

[Feature] Fix guided decoding blocking bitmask memcpy (#12563)

d27e55d2

[Doc] Improve installation signposting (#12575)

bece70b9

[Doc] int4 w4a16 example (#12585)

6b7e4331

[V1] Bugfix: Validate Model Input Length (#12600)

fd9060b1

[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (…

8ae26746

Fix target matching for fused layers with compressed-tensors (#12617)

19d375d6

[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599)

64d11309

[Bugfix/CI] Fixup benchmark_moe.py (#12562)

674ab715

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517)

9a614434

[Attention] Deepseek v3 MLA support with FP8 compute (#12601)

bb942601

[CI/Build] Add label automation for structured-output, speculative-de…

55727d05

Disable chunked prefill and/or prefix caching when MLA is enabled (#…

4bad7108

Apply torch.compile to fused_moe/grouped_topk (#12637)

f292876e

doc: fixing minor typo in readme.md (#12643)

0079b1c7

[Bugfix] fix moe_wna16 get_quant_method (#12648)

a4124cbb

[Core] Silence unnecessary deprecation warnings (#12620)

8f1a0616

[V1][Minor] Avoid frequently creating ConstantList (#12653)

f4e9f990

[Core][v1] Unify allocating slots in prefill and decode in KV cache m…

f709c159

[Hardware][Intel GPU] add XPU bf16 support (#12392)

87e5e8b8

[Misc] Add SPDX-License-Identifier headers to python source files (#1…

b9895606

[doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)

d0cd67a7

srikanthsrnvs force pushed to d0cd67a7 1 year ago

srikanthsrnvs requested a review from

youkaichao 1 year ago

srikanthsrnvs requested a review from

alexm-redhat 1 year ago

srikanthsrnvs requested a review from

comaniac 1 year ago

srikanthsrnvs requested a review from

simon-mo 1 year ago

srikanthsrnvs requested a review from

WoosukKwon 1 year ago

srikanthsrnvs requested a review from

njhill 1 year ago

srikanthsrnvs requested a review from

LiuXiaoxuanPKU 1 year ago

srikanthsrnvs requested a review from

KuntaiDu 1 year ago

srikanthsrnvs requested a review from

DarkLight1337 1 year ago

srikanthsrnvs requested a review from

ywang96 1 year ago

srikanthsrnvs requested a review from

zhuohan123 1 year ago

mergify added documentation

mergify added ci/build

mergify added frontend

mergify added structured-output

mergify added speculative-decoding

mergify added v1

mergify added needs-rebase

Merge branch 'main' into fix-moe-wna16-attention

8b5a0ea1

mergify removed needs-rebase

unused imports

9d09ec0d

DarkLight1337 enabled auto-merge (squash) 1 year ago

disabled auto-merge 1 year ago
Manually disabled by user

youkaichao merged b9986454 into main 1 year ago

Reviewers

mgoin

robertgshaw2-redhat

tlrmchlsmth

youkaichao

alexm-redhat

comaniac

simon-mo

WoosukKwon

njhill

LiuXiaoxuanPKU

KuntaiDu

DarkLight1337

ywang96

zhuohan123

Assignees

No one assigned

Labels

documentation structured-output frontend speculative-decoding ready ci/build v1

Milestone

No milestone

vllm Fix for attention layers to remain unquantized during moe_wn16 quant #12570 Merged

Fix for attention layers to remain unquantized during moe_wn16 quant #12570

vllm
Fix for attention layers to remain unquantized during moe_wn16 quant
#12570

Merged