[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` #25293
yewentao256
changed the title [DO NOT MERGE] Test [Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` 156 days ago
optimize: eliminate duplicate split_enc_dec_inputs calls (#25573)
d8ffa3c5
[Bugfix] fix apply_temperature to avoid nan in probs (#24734)
94b78f57
[Misc] Simplify PoolerOutput and move to `v1/outputs` (#25629)
8b17d255
Map CwmForCausalLM to llama and LlamaForCausalLM (#25611)
004eed39
typo: remove duplicate `is` (#25641)
6a437a41
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class…
6c6e5536
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and D…
5e16b8c5
[Misc] Fix Qwen3-VL `video_grid_thw` typing (#25646)
fd28c588
[Bugfix] Add triton.language.tensor placeholder (#25649)
034c0152
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video prof…
f17d37b0
[mypy] Further improve MM type annotations (#25654)
686cfd91
[Bugfix] Parse SpeculativeConfig Error (#25142)
3d940e2c
[V0 deprecation] Remove unreachable model_config.supported_tasks (#25…
f3d9099b
Add backward compatibility for `guided_...` API (#25615)
22114ffe
[CI/Build] Fix flaky entrypoints test (#25663)
22241131
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash (#25643)
d7f6489f
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112)
a88371f8
[mypy] Fix wrong type annotations related to tuple (#25660)
af10a37c
[misc] log info messages by default for hanging / busy / idle (#25627)
a5fa821b
[torch.compile] Make Query Quantization Fusable (#24914)
18c20257
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata (#…
2469b829
[ux] Switch a warning to debug about a pytorch fallback (#23750)
054c8b52
[Bugfix] Fix InternS1 video processing after Transformers v4.56 (#25644)
f7f76a86
[Misc] Remove cruft file in repo (#25678)
91d42997
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning …
2655d7ab
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin…
252a0ff8
Revert "[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_func…
0cee734a
[BugFix] Fix DBO hang (#25625)
fe6357a7
[Model] Add optional parameter to reasoning parser constructor (#25554)
c7ca3c5d
[Model] Define `merge_by_field_config` MM interface (#25676)
34e6a31e
[V0 deprecation] Clean up V0 fallback in compilation config (#25675)
9659b7e7
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend name…
a3555612
[V0 deprecation] Clean up LoRA (#25686)
80385959
[Misc] Simplify `test_argsort_mm_positions` (#25690)
b0e9f04b
[Optimization] Streamline `InputPreprocessor` (#25702)
745b204d
[Optimization] Use a cheaper cache key in `get_model_architecture` (#…
b558c3a8
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)
f3a478b5
[Core] Enable command line logging for LLMEngine (#25610)
37d83608
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 (#25708)
1d1436c3
Fix routing_bias dtype (#25711)
1d210801
[Refactor] Remove DeepGEMM OP Register (#25710)
3a32aa8a
[Misc] Don't log shm dequeue delay warning on worker side (#25720)
6f97de4e
Llamas 3.1 405B fp4 changes upstreaming from 355_wip (#25135)
c064c826
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder (#25701)
ef160aa0
[Misc] Remove unnecessary memoryviews in shm_broadcast.py (#25721)
6ada2212
EVS Support (Video tokens pruning) (#22980)
9e6628cc
[CI/Build] fix doc build warning: Failed to get 'name: description' p…
e82e3b55
fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid h…
74ea69f4
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds…
b2d5d423
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300…
79586c54
fix: print outputt offline_inference/base/chat.py example (#25744)
0aea9348
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and …
067fe8b1
Remove cuda hard-code in compute_causal_conv1d_metadata (#25555)
bc37468b
[misc] refactor speculative config (#25657)
c761b84d
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk…
fa55373a
Support LongCat-Flash-Chat tool call (#24083)
ced693e8
[Doc] Update Batch-level DP docs (#25757)
87ee8535
[Model] Mamba2 varlen refactor (#21467)
62ae26c8
[CI] Fix test_shared_storage_connector_hashes (#25748)
515e30b0
[Bugfix] Properly abort pooling request. (#25734)
fb0eece2
[CI/Build] Split up Distributed Tests (#25572)
d3c732e9
[CI/Build] Fix some V1 tests not being run (#25569)
129a643b
[Quantization] Add field to skip unquantized modules for GPTQ config …
d70c1549
[BugFix] Fix using `dbo_decode_token_threshold` always (and ignoring …
6ca8d975
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility i…
41174e28
[Misc] fix unique_filepath (#25732)
c7229821
Eagle3 that supports the Minicpm3 model (#24243)
e0175fbf
[Doc]: improve CPU(x86) build-wheel-from-source section (#25617)
8c1b61bd
[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…
f16c440c
[Docs] Add Toronto Meetup (#25773)
51577819
[CI] Add E2E Blackwell Quantized MoE Test (#25723)
b6f16d37
[V1] address post issues related to #20059 (part 1) (#23046)
ceb34601
[CI] Fix FlashInfer AOT in release docker image (#25730)
dc191cc5
[spec decode] Consolidate speculative decode method name for MTP (#25…
1356ae0a
Reduce the Cuda Graph memory footprint when running with DBO (#25779)
dbdea93f
Kernel-override Determinism [1/n] (#25603)
c4b9864e
[Bugfix] Optimize CpuGpuBuffer initialization (#25447)
e7cba8f6
[Spec decode] automatically disable mm for text-only draft models (#2…
93ba7648
[Core] Don't count preempted tokens in prefix cache hit rate (#25787)
806b292c
Add option to restrict media domains (#25783)
dbb7782d
Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…
55971f85
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement…
38c2df83
[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…
f16c440c
[Docs] Add Toronto Meetup (#25773)
51577819
[CI] Add E2E Blackwell Quantized MoE Test (#25723)
b6f16d37
[V1] address post issues related to #20059 (part 1) (#23046)
ceb34601
[CI] Fix FlashInfer AOT in release docker image (#25730)
dc191cc5
Reduce the Cuda Graph memory footprint when running with DBO (#25779)
dbdea93f
Kernel-override Determinism [1/n] (#25603)
c4b9864e
[Bugfix] Optimize CpuGpuBuffer initialization (#25447)
e7cba8f6
[Spec decode] automatically disable mm for text-only draft models (#2…
93ba7648
[Core] Don't count preempted tokens in prefix cache hit rate (#25787)
806b292c
Add option to restrict media domains (#25783)
dbb7782d
Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…
55971f85
[CI/Build] Add timing to Model Executor Test (#25799)
a8913725
[Misc] Update openai client example file for multimodal (#25795)
d7cf3783
[Bugfix] Add missing `image_size` for phi4_multimodal (#25796)
6970fa99
Validate API tokens in constant time (#25781)
3e7f33c8
Fix GPTQ model loading in Transformers backend (#25770)
c7ae7edb
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982)
e94aabe0
[docs] transcriptions API audio upload (#25446)
7d92e508
[env] default nixl side port conflicts with kv-event zmq port (#25056)
9b4c7521
[Core] Refactor self.model() to call a helper for subclassing. (#25084)
7b28ef2b
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (#25651)
d8fc00d6
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (#…
942fba38
[Bugfix] Fix Qwen3-VL regression from #24982 (#25814)
495f3682
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurab…
6dee906d
Update GLM-4.5 Doc transformers version (#25830)
e40c1269
Remove redundant cudagraph dispatcher warning (#25841)
cf0a7912
[Misc] fix tests failure by using current_platform (#25825)
eb447aff
[P/D] NIXL Updates (#25844)
70ba2d1e
[Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851)
4079a63a
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings me…
b765adcc
[Misc] Remove more `get_input_embeddings_v0` (#25857)
ea55445b
update to latest deepgemm for dsv3.2 (#25871)
770a2cf7
[Bugfix] Fix requirements paths in install instructions (#25827)
85d43060
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantize…
4e2774f5
[torch.compile] serialize cudagraph_mode as its enum name instead of …
9f78b9ca
[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690)
f84b2a0d
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (…
c3399215
[CI/Build] Include Transformers backend test in nightly transformers …
616bce15
[Bugfix] Use correct key "ignore" for config.json non-quantized layer…
9555929e
[BugFix][torch.compile] KV scale calculation issues with FP8 quantiza…
c692506e
[Doc] Add documentation for vLLM continuous benchmarking and profilin…
ae0c3592
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libn…
e7203c23
[Kernel] Chunk-aligned mamba2 (#24683)
b7973eab
[Doc] Polish example for torchrun dp (#25899)
4deb9c88
[V0 Deprecation] Remove `vllm.worker` and update according imports (#…
97f1312f
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT…
6941d53c
Move`VllmConfig` from `config/__init__.py` to `config/vllm.py` (#25271)
ea7cf8db
[BugFix] Fix DP/EP hang (#25906)
e165f980
[BugFix] Pass config_format via try_get_generation_config (#25912)
db4a03e2
[Bugfix]: Clean up chunked prefill logging when using whisper (#25075)
da716513
Updated TRL integration docs (#25684)
c0734fc5
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (#25936)
9dce93e0
[Model] Move `vision_feature_select_strategy` into `resolve_visual_en…
a1898466
[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)
eea2536a
[NIXL] Add support for MLA caches with different latent dim (#25902)
bf8bb7e2
[CI] Move applicable tests to CPU (#24080)
8914d528
[Fix] Improve CPU backend compatibility for RISC-V (#25816)
02776c03
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…
d9f8ded1
Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)
b6ea29b7
[Bugfix][Model] Fix inference for Hunyuan dense models (#25354)
ea6144a0
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#2…
8c52fccb
[Bugfix] Token type and position embeddings fail to be applied to `in…
e33579cd
[bugfix][deepseek] fix flashmla kernel selection (#25956)
206ab1f0
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute…
3c75d3b0
[Doc] Improve MM Pooling model documentation (#25966)
493acdb7
[Docs] Add moe kernel features doc (#25297)
6083b4d9
OffloadingConnector: Fix GPU block tracking bug (#25856)
bb2e04e4
[Llama4] [multimodal] Fix misplaced dtype cast of `cos_sin_cache` in …
8ecccdd1
[Bench] Add DeepSeekV32 to MoE benchmark (#25962)
ef318228
[V1] [P/D] Add Support for KV Load Failure Recovery (#19330)
8328d39d
Add explicit pooling classes for the Transformers backend (#25322)
b3e1846d
[Docs] Remove API Reference from search index (#25949)
16909544
[gpt-oss] use vLLM instead of openai types for streaming (#25186)
fd56f2e6
[Misc] Make EP kernels install script support uv (#25785)
e734a2a0
[Model] MTP fallback to eager for DeepSeek v32 (#25982)
d437ba32
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arc…
04cb503f
[Log] Optimize Log for FP8MOE (#25709)
2b6b8599
Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)
cd0bbf5d
[MM] Add text-only mode for Qwen3-VL (#26000)
4c094b33
[Bugfix] Fix `__syncwarp` on ROCM (#25996)
6444f65a
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988)
7c795fdf
Update to Transformers `v4.56.2` (#24638)
fda81983
[Misc]allow disable pynccl (#25421)
9506409f
[Doc] updating torch.compile doc link (#25989)
b9ed8c96
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-t…
25e5b9cc
[Misc] Factor out common `_apply_feature_select_strategy` (#26003)
63c56cbb
[CI] Only capture a single CUDA graph size in CI by default (#25951)
e8773e62
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…
a561b983
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …
aeff0604
[Bugfix] Apply same sampling parameters for both `n=1` and `n>1` (#26…
0944358a
[NVIDIA] Blackwell Family (#24673)
ed7eb771
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…
d2f54401
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)
bba76234
[BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026)
90529cec
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)
d4a83e01
[Bug] Fix Negative Cuda Memory Usage (#25683)
ce8ee3d9
[BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)
ac1598d1
Support RL online quantization with torchao (#23014)
2ae74a80
[ROCm][Bugfix] Add missing parameter to ROCm backend (#26029)
91e10c72
[Model] Move `vision_feature_select_strategy` into `resolve_visual_en…
a1898466
[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)
eea2536a
[NIXL] Add support for MLA caches with different latent dim (#25902)
bf8bb7e2
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…
d9f8ded1
Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)
b6ea29b7
[Bugfix][Model] Fix inference for Hunyuan dense models (#25354)
ea6144a0
[Bugfix] Token type and position embeddings fail to be applied to `in…
e33579cd
OffloadingConnector: Fix GPU block tracking bug (#25856)
bb2e04e4
[Bench] Add DeepSeekV32 to MoE benchmark (#25962)
ef318228
Add explicit pooling classes for the Transformers backend (#25322)
b3e1846d
[gpt-oss] use vLLM instead of openai types for streaming (#25186)
fd56f2e6
[Misc] Make EP kernels install script support uv (#25785)
e734a2a0
[Model] MTP fallback to eager for DeepSeek v32 (#25982)
d437ba32
Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)
cd0bbf5d
Update to Transformers `v4.56.2` (#24638)
fda81983
[Misc]allow disable pynccl (#25421)
9506409f
[Doc] updating torch.compile doc link (#25989)
b9ed8c96
[CI] Only capture a single CUDA graph size in CI by default (#25951)
e8773e62
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…
a561b983
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …
aeff0604
[NVIDIA] Blackwell Family (#24673)
ed7eb771
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…
d2f54401
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)
bba76234
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)
d4a83e01
[Bug] Fix Negative Cuda Memory Usage (#25683)
ce8ee3d9
[BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)
ac1598d1
[Misc] Make handling of SamplingParams clearer in n>1 case (#26032)
93d2be10
[CI/Build] Replace `vllm.entrypoints.openai.api_server` entrypoint wi…
fa179abd
[Small] Prevent bypassing media domain restriction via HTTP redirects…
c5880cfa
EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…
da3a188b
[Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MT…
d737c66b
[Perf] Fix and reapply move apply w8a8 block fp8 linear to class (#25…
abc55b1f
Fix MTP with deepep_low_latency (#25904)
72c5dd03
[Bugfix] Disable cascade attention with FlashInfer (#26130)
0c76bb2d
[Log] Optimize DeepGEMM Missing Log (#26106)
587b30c5
[Bug][Benchmark] Fix duplicate req in oversampling (#26140)
8db7b7f3
[Attention] Move Backend enum into registry (#25893)
2ea7d486
[CI/Build] Conditionally register cutlass_fp4_group_mm to fix buildin…
173c8a95
[DeepSeek] Improve performance of DS MLA cache kernel (#26132)
a06bb9bf
[Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (#26144)
56d0073f
[gpt-oss] disable tool server initialization if no tool in request (#…
79b2fe7f
[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (…
218349d7
[ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (#2…
f35f896e
[Bugfix] Fix import `gemm_afp4wfp4` failure on AMD (#26068)
09b1a567
[Model] Use `merge_by_field_config` for MM models (G) (#26117)
bbeace23
`FusedMoE` support for the Transformers backend (#22650)
6b12b2ee
[BUG] Reorder model config creation (#26124)
d628fa1e
[Misc] Remove typing.List (#26150)
7e4b1861
[Input] Remove unused `prompt` field (#26097)
ae03f4c0
[Perf] Optimize `reshape_and_cache` CUDA Kernel (#25955)
5b80f220
add(v1): RequestStatesStats to RequestOutput (#24947)
edaae182
[Model] Use `merge_by_field_config` for MM models (InternVL family) (…
c81dc099
[test utils] correct wrong typing (#26159)
c6344152
[CI] Fix distributed hybrid tests in CI (#26155)
8d332b3c
[NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388)
2168fc8f
[openai] Fix missing tool usage check (system message) (#24768)
fa29d31f
[Multi Modal] Configurable MM Profiling (#25631)
2bcc7450
[Doc] Fixed shape description for fused_batched_moe.py (#25668)
564233d5
Quick fix for IMA with the Prefix Prefill kernel during graph capture…
f3768686
[Renderer] Move Processor out of AsyncLLM (#24138)
ff1daf6c
[Bugfix] Re-enable prefill of max model length (#24446)
7faf51f1
[backends][short_conv] CUDA graph piecewise edits (#24215)
c6f384da
[Model] Supplement to PR 24862: Pass param prefix to LLMHead (#25805)
fac9b430
[CI/Build] do not enforce precompilation on tpu ci tests (#25992)
d8b1f9cc
[Model] Fixed stream generator for gpt-oss + spec-decoding (#26027)
c40c0d9c
[Renderer] Move Processor out of LLMEngine (#26165)
611c23b6
Fix undefined symbol: cutlass_moe_mm_sm100 (#26098)
84135b14
[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduce…
e45271b0
Stop mergify from keeping stale PRs alive (#26169)
2d68bba3
Avoid division by zero in cache DS MLA kernel (#26174)
13e211bb
Fix V1 engine serialization error with Ray distributed executor (#26148)
9ea82ecd
[Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix…
920db411
Merge branch 'vllm-project:wye-refactor-w8a8-quant' into wye-refactor…
14bd6841
Merge branch 'main' into wye-refactor-w8a8-quant
13f8310d
mgoin
commented
on 2025-09-25
mgoin
approved these changes
on 2025-10-08
mgoin
merged
241b4cfe
into main 137 days ago
Assignees
No one assigned
Labels
rocm
ready
ci/build
qwen
Login to write a write a comment.
Login via GitHub