vllm
Fix for attention layers to remain unquantized during moe_wn16 quant
#12570
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
42
Changes
View On
GitHub
Fix for attention layers to remain unquantized during moe_wn16 quant
#12570
youkaichao
merged 42 commits into
vllm-project:main
from
srikanthsrnvs:fix-moe-wna16-attention
srikanthsrnvs
requested a review
from
mgoin
1 year ago
srikanthsrnvs
requested a review
from
robertgshaw2-redhat
1 year ago
srikanthsrnvs
requested a review
from
tlrmchlsmth
1 year ago
mgoin
approved these changes on 2025-01-31
mgoin
added
quantization
mgoin
added
ready
Fix for attention layers to remain unquantized during moe_wn16 quant …
483b60c0
Set `?device={device}` when changing tab in installation guides (#12560)
915fdce8
[Misc] fix typo: add missing space in lora adapter error message (#12…
d689505a
[Kernel] Triton Configs for Fp8 Block Quantization (#11589)
689bd199
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)
f7a4e122
[V1][Log] Add max request concurrency log to V1 (#12569)
95b49be3
[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) s…
b0d72881
[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)
9813962a
[Attention] MLA decode optimizations (#12528)
897c8c24
[Bugfix] Gracefully handle huggingface hub http error (#12571)
c4795ce0
Format
a5e6700c
Add favicon to docs (#12611)
1ce860be
[BugFix] Fix Torch.Compile For DeepSeek (#12594)
bc9d8314
[Git] Automatically sign-off commits (#12595)
22b918de
[Docs][V1] Prefix caching design (#12598)
00df0e4b
[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)
44fa70d9
[release] Add input step to ask for Release version (#12631)
fdd86fbb
[Bugfix] Revert MoE Triton Config Default (#12629)
c4a7c261
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for …
e7c98c61
[Feature] Fix guided decoding blocking bitmask memcpy (#12563)
d27e55d2
[Doc] Improve installation signposting (#12575)
bece70b9
[Doc] int4 w4a16 example (#12585)
6b7e4331
[V1] Bugfix: Validate Model Input Length (#12600)
fd9060b1
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (…
8ae26746
Fix target matching for fused layers with compressed-tensors (#12617)
19d375d6
[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599)
64d11309
[Bugfix/CI] Fixup benchmark_moe.py (#12562)
674ab715
Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517)
9a614434
[Attention] Deepseek v3 MLA support with FP8 compute (#12601)
bb942601
[CI/Build] Add label automation for structured-output, speculative-de…
55727d05
Disable chunked prefill and/or prefix caching when MLA is enabled (#…
4bad7108
Apply torch.compile to fused_moe/grouped_topk (#12637)
f292876e
doc: fixing minor typo in readme.md (#12643)
0079b1c7
[Bugfix] fix moe_wna16 get_quant_method (#12648)
a4124cbb
[Core] Silence unnecessary deprecation warnings (#12620)
8f1a0616
[V1][Minor] Avoid frequently creating ConstantList (#12653)
f4e9f990
[Core][v1] Unify allocating slots in prefill and decode in KV cache m…
f709c159
[Hardware][Intel GPU] add XPU bf16 support (#12392)
87e5e8b8
[Misc] Add SPDX-License-Identifier headers to python source files (#1…
b9895606
[doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)
d0cd67a7
srikanthsrnvs
force pushed
to
d0cd67a7
1 year ago
srikanthsrnvs
requested a review
from
youkaichao
1 year ago
srikanthsrnvs
requested a review
from
alexm-redhat
1 year ago
srikanthsrnvs
requested a review
from
comaniac
1 year ago
srikanthsrnvs
requested a review
from
simon-mo
1 year ago
srikanthsrnvs
requested a review
from
WoosukKwon
1 year ago
srikanthsrnvs
requested a review
from
njhill
1 year ago
srikanthsrnvs
requested a review
from
LiuXiaoxuanPKU
1 year ago
srikanthsrnvs
requested a review
from
KuntaiDu
1 year ago
srikanthsrnvs
requested a review
from
DarkLight1337
1 year ago
srikanthsrnvs
requested a review
from
ywang96
1 year ago
srikanthsrnvs
requested a review
from
zhuohan123
1 year ago
mergify
added
documentation
mergify
added
ci/build
mergify
added
frontend
mergify
added
structured-output
mergify
added
speculative-decoding
mergify
added
v1
mergify
added
needs-rebase
Merge branch 'main' into fix-moe-wna16-attention
8b5a0ea1
mergify
removed
needs-rebase
unused imports
9d09ec0d
DarkLight1337
enabled auto-merge (squash)
1 year ago
disabled auto-merge
1 year ago
Manually disabled by user
youkaichao
merged
b9986454
into main
1 year ago
Login to write a write a comment.
Login via GitHub
Reviewers
mgoin
robertgshaw2-redhat
tlrmchlsmth
youkaichao
alexm-redhat
comaniac
simon-mo
WoosukKwon
njhill
LiuXiaoxuanPKU
KuntaiDu
DarkLight1337
ywang96
zhuohan123
Assignees
No one assigned
Labels
documentation
structured-output
frontend
speculative-decoding
ready
ci/build
v1
Milestone
No milestone
Login to write a write a comment.
Login via GitHub