PR #25293 [Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8`

yewentao256 requested a review from

tlrmchlsmth 156 days ago

yewentao256 requested a review from

LucasWilkinson 156 days ago

yewentao256 requested a review from

mgoin 156 days ago

yewentao256 requested a review from

robertgshaw2-redhat 156 days ago

yewentao256 added ready

mergify added ci/build

mergify added rocm

gemini-code-assist commented on 2025-09-19

yewentao256 changed the title ~~[DO NOT MERGE] Test~~ [Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` 156 days ago

mergify added needs-rebase

mergify removed needs-rebase

mergify added needs-rebase

yewentao256 requested a review from

sighingnow 142 days ago

yewentao256 requested a review from

22quinn 142 days ago

mergify added qwen

optimize: eliminate duplicate split_enc_dec_inputs calls (#25573)

d8ffa3c5

[Bugfix] fix apply_temperature to avoid nan in probs (#24734)

94b78f57

[Misc] Simplify PoolerOutput and move to `v1/outputs` (#25629)

8b17d255

Map CwmForCausalLM to llama and LlamaForCausalLM (#25611)

004eed39

typo: remove duplicate `is` (#25641)

6a437a41

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class…

6c6e5536

[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and D…

5e16b8c5

[Misc] Fix Qwen3-VL `video_grid_thw` typing (#25646)

fd28c588

[Bugfix] Add triton.language.tensor placeholder (#25649)

034c0152

[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video prof…

f17d37b0

[mypy] Further improve MM type annotations (#25654)

686cfd91

[Bugfix] Parse SpeculativeConfig Error (#25142)

3d940e2c

[V0 deprecation] Remove unreachable model_config.supported_tasks (#25…

f3d9099b

Add backward compatibility for `guided_...` API (#25615)

22114ffe

[CI/Build] Fix flaky entrypoints test (#25663)

22241131

[XPU][Triton]add xpu config in triton_reshape_and_cache_flash (#25643)

d7f6489f

[Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112)

a88371f8

[mypy] Fix wrong type annotations related to tuple (#25660)

af10a37c

[misc] log info messages by default for hanging / busy / idle (#25627)

a5fa821b

[torch.compile] Make Query Quantization Fusable (#24914)

18c20257

[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata (#…

2469b829

[ux] Switch a warning to debug about a pytorch fallback (#23750)

054c8b52

[Bugfix] Fix InternS1 video processing after Transformers v4.56 (#25644)

f7f76a86

[Misc] Remove cruft file in repo (#25678)

91d42997

[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning …

2655d7ab

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin…

252a0ff8

Revert "[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_func…

0cee734a

[BugFix] Fix DBO hang (#25625)

fe6357a7

[Model] Add optional parameter to reasoning parser constructor (#25554)

c7ca3c5d

[Model] Define `merge_by_field_config` MM interface (#25676)

34e6a31e

[V0 deprecation] Clean up V0 fallback in compilation config (#25675)

9659b7e7

[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend name…

a3555612

[V0 deprecation] Clean up LoRA (#25686)

80385959

[Misc] Simplify `test_argsort_mm_positions` (#25690)

b0e9f04b

[Optimization] Streamline `InputPreprocessor` (#25702)

745b204d

[Optimization] Use a cheaper cache key in `get_model_architecture` (#…

b558c3a8

[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)

f3a478b5

[Core] Enable command line logging for LLMEngine (#25610)

37d83608

[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 (#25708)

1d1436c3

Fix routing_bias dtype (#25711)

1d210801

[Refactor] Remove DeepGEMM OP Register (#25710)

3a32aa8a

[Misc] Don't log shm dequeue delay warning on worker side (#25720)

6f97de4e

Llamas 3.1 405B fp4 changes upstreaming from 355_wip (#25135)

c064c826

[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder (#25701)

ef160aa0

[Misc] Remove unnecessary memoryviews in shm_broadcast.py (#25721)

6ada2212

EVS Support (Video tokens pruning) (#22980)

9e6628cc

[CI/Build] fix doc build warning: Failed to get 'name: description' p…

e82e3b55

fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid h…

74ea69f4

perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds…

b2d5d423

[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300…

79586c54

fix: print outputt offline_inference/base/chat.py example (#25744)

0aea9348

[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and …

067fe8b1

Remove cuda hard-code in compute_causal_conv1d_metadata (#25555)

bc37468b

[misc] refactor speculative config (#25657)

c761b84d

[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk…

fa55373a

Support LongCat-Flash-Chat tool call (#24083)

ced693e8

[Doc] Update Batch-level DP docs (#25757)

87ee8535

[Model] Mamba2 varlen refactor (#21467)

62ae26c8

[CI] Fix test_shared_storage_connector_hashes (#25748)

515e30b0

[Bugfix] Properly abort pooling request. (#25734)

fb0eece2

[CI/Build] Split up Distributed Tests (#25572)

d3c732e9

[CI/Build] Fix some V1 tests not being run (#25569)

129a643b

[Quantization] Add field to skip unquantized modules for GPTQ config …

d70c1549

[BugFix] Fix using `dbo_decode_token_threshold` always (and ignoring …

6ca8d975

[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility i…

41174e28

[Misc] fix unique_filepath (#25732)

c7229821

Eagle3 that supports the Minicpm3 model (#24243)

e0175fbf

[Doc]: improve CPU(x86) build-wheel-from-source section (#25617)

8c1b61bd

[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…

f16c440c

[Docs] Add Toronto Meetup (#25773)

51577819

[CI] Add E2E Blackwell Quantized MoE Test (#25723)

b6f16d37

[V1] address post issues related to #20059 (part 1) (#23046)

ceb34601

[CI] Fix FlashInfer AOT in release docker image (#25730)

dc191cc5

[spec decode] Consolidate speculative decode method name for MTP (#25…

1356ae0a

Reduce the Cuda Graph memory footprint when running with DBO (#25779)

dbdea93f

Kernel-override Determinism [1/n] (#25603)

c4b9864e

[Bugfix] Optimize CpuGpuBuffer initialization (#25447)

e7cba8f6

[Spec decode] automatically disable mm for text-only draft models (#2…

93ba7648

[Core] Don't count preempted tokens in prefix cache hit rate (#25787)

806b292c

Add option to restrict media domains (#25783)

dbb7782d

Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…

55971f85

[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement…

38c2df83

[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…

f16c440c

[Docs] Add Toronto Meetup (#25773)

51577819

[CI] Add E2E Blackwell Quantized MoE Test (#25723)

b6f16d37

[V1] address post issues related to #20059 (part 1) (#23046)

ceb34601

[CI] Fix FlashInfer AOT in release docker image (#25730)

dc191cc5

Reduce the Cuda Graph memory footprint when running with DBO (#25779)

dbdea93f

Kernel-override Determinism [1/n] (#25603)

c4b9864e

[Bugfix] Optimize CpuGpuBuffer initialization (#25447)

e7cba8f6

[Spec decode] automatically disable mm for text-only draft models (#2…

93ba7648

[Core] Don't count preempted tokens in prefix cache hit rate (#25787)

806b292c

Add option to restrict media domains (#25783)

dbb7782d

Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…

55971f85

[CI/Build] Add timing to Model Executor Test (#25799)

a8913725

[Misc] Update openai client example file for multimodal (#25795)

d7cf3783

[Bugfix] Add missing `image_size` for phi4_multimodal (#25796)

6970fa99

Validate API tokens in constant time (#25781)

3e7f33c8

Fix GPTQ model loading in Transformers backend (#25770)

c7ae7edb

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982)

e94aabe0

[docs] transcriptions API audio upload (#25446)

7d92e508

[env] default nixl side port conflicts with kv-event zmq port (#25056)

9b4c7521

[Core] Refactor self.model() to call a helper for subclassing. (#25084)

7b28ef2b

[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (#25651)

d8fc00d6

[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (#…

942fba38

[Bugfix] Fix Qwen3-VL regression from #24982 (#25814)

495f3682

[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurab…

6dee906d

Update GLM-4.5 Doc transformers version (#25830)

e40c1269

Remove redundant cudagraph dispatcher warning (#25841)

cf0a7912

[Misc] fix tests failure by using current_platform (#25825)

eb447aff

[P/D] NIXL Updates (#25844)

70ba2d1e

[Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851)

4079a63a

[V0 Deprecation][Models] Remove all V0 condition for mm embeddings me…

b765adcc

[Misc] Remove more `get_input_embeddings_v0` (#25857)

ea55445b

update to latest deepgemm for dsv3.2 (#25871)

770a2cf7

[Bugfix] Fix requirements paths in install instructions (#25827)

85d43060

[Model][Bugfix] Fix issues in MiDashengLM implementation for quantize…

4e2774f5

[torch.compile] serialize cudagraph_mode as its enum name instead of …

9f78b9ca

[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690)

f84b2a0d

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (…

c3399215

[CI/Build] Include Transformers backend test in nightly transformers …

616bce15

[Bugfix] Use correct key "ignore" for config.json non-quantized layer…

9555929e

[BugFix][torch.compile] KV scale calculation issues with FP8 quantiza…

c692506e

[Doc] Add documentation for vLLM continuous benchmarking and profilin…

ae0c3592

[Bugfix][ROCm] Fixing trying to import non-existent symbols from libn…

e7203c23

[Kernel] Chunk-aligned mamba2 (#24683)

b7973eab

[Doc] Polish example for torchrun dp (#25899)

4deb9c88

[V0 Deprecation] Remove `vllm.worker` and update according imports (#…

97f1312f

Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT…

6941d53c

Move`VllmConfig` from `config/__init__.py` to `config/vllm.py` (#25271)

ea7cf8db

[BugFix] Fix DP/EP hang (#25906)

e165f980

[BugFix] Pass config_format via try_get_generation_config (#25912)

db4a03e2

[Bugfix]: Clean up chunked prefill logging when using whisper (#25075)

da716513

Updated TRL integration docs (#25684)

c0734fc5

[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (#25936)

9dce93e0

[Model] Move `vision_feature_select_strategy` into `resolve_visual_en…

a1898466

[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)

eea2536a

[NIXL] Add support for MLA caches with different latent dim (#25902)

bf8bb7e2

[CI] Move applicable tests to CPU (#24080)

8914d528

[Fix] Improve CPU backend compatibility for RISC-V (#25816)

02776c03

[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…

d9f8ded1

Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)

b6ea29b7

[Bugfix][Model] Fix inference for Hunyuan dense models (#25354)

ea6144a0

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#2…

8c52fccb

[Bugfix] Token type and position embeddings fail to be applied to `in…

e33579cd

[bugfix][deepseek] fix flashmla kernel selection (#25956)

206ab1f0

[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute…

3c75d3b0

[Doc] Improve MM Pooling model documentation (#25966)

493acdb7

[Docs] Add moe kernel features doc (#25297)

6083b4d9

OffloadingConnector: Fix GPU block tracking bug (#25856)

bb2e04e4

[Llama4] [multimodal] Fix misplaced dtype cast of `cos_sin_cache` in …

8ecccdd1

[Bench] Add DeepSeekV32 to MoE benchmark (#25962)

ef318228

[V1] [P/D] Add Support for KV Load Failure Recovery (#19330)

8328d39d

Add explicit pooling classes for the Transformers backend (#25322)

b3e1846d

[Docs] Remove API Reference from search index (#25949)

16909544

[gpt-oss] use vLLM instead of openai types for streaming (#25186)

fd56f2e6

[Misc] Make EP kernels install script support uv (#25785)

e734a2a0

[Model] MTP fallback to eager for DeepSeek v32 (#25982)

d437ba32

Update launch_bounds_utils.h for correct compile on Multiple Cuda Arc…

04cb503f

[Log] Optimize Log for FP8MOE (#25709)

2b6b8599

Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)

cd0bbf5d

[MM] Add text-only mode for Qwen3-VL (#26000)

4c094b33

[Bugfix] Fix `__syncwarp` on ROCM (#25996)

6444f65a

[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988)

7c795fdf

Update to Transformers `v4.56.2` (#24638)

fda81983

[Misc]allow disable pynccl (#25421)

9506409f

[Doc] updating torch.compile doc link (#25989)

b9ed8c96

[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-t…

25e5b9cc

[Misc] Factor out common `_apply_feature_select_strategy` (#26003)

63c56cbb

[CI] Only capture a single CUDA graph size in CI by default (#25951)

e8773e62

[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…

a561b983

[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …

aeff0604

[Bugfix] Apply same sampling parameters for both `n=1` and `n>1` (#26…

0944358a

[NVIDIA] Blackwell Family (#24673)

ed7eb771

Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…

d2f54401

[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)

bba76234

[BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026)

90529cec

[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)

d4a83e01

[Bug] Fix Negative Cuda Memory Usage (#25683)

ce8ee3d9

[BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)

ac1598d1

Support RL online quantization with torchao (#23014)

2ae74a80

[ROCm][Bugfix] Add missing parameter to ROCm backend (#26029)

91e10c72

[Model] Move `vision_feature_select_strategy` into `resolve_visual_en…

a1898466

[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)

eea2536a

[NIXL] Add support for MLA caches with different latent dim (#25902)

bf8bb7e2

[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…

d9f8ded1

Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)

b6ea29b7

[Bugfix][Model] Fix inference for Hunyuan dense models (#25354)

ea6144a0

[Bugfix] Token type and position embeddings fail to be applied to `in…

e33579cd

OffloadingConnector: Fix GPU block tracking bug (#25856)

bb2e04e4

[Bench] Add DeepSeekV32 to MoE benchmark (#25962)

ef318228

Add explicit pooling classes for the Transformers backend (#25322)

b3e1846d

[gpt-oss] use vLLM instead of openai types for streaming (#25186)

fd56f2e6

[Misc] Make EP kernels install script support uv (#25785)

e734a2a0

[Model] MTP fallback to eager for DeepSeek v32 (#25982)

d437ba32

Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)

cd0bbf5d

Update to Transformers `v4.56.2` (#24638)

fda81983

[Misc]allow disable pynccl (#25421)

9506409f

[Doc] updating torch.compile doc link (#25989)

b9ed8c96

[CI] Only capture a single CUDA graph size in CI by default (#25951)

e8773e62

[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…

a561b983

[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …

aeff0604

[NVIDIA] Blackwell Family (#24673)

ed7eb771

Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…

d2f54401

[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)

bba76234

[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)

d4a83e01

[Bug] Fix Negative Cuda Memory Usage (#25683)

ce8ee3d9

[BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)

ac1598d1

[Misc] Make handling of SamplingParams clearer in n>1 case (#26032)

93d2be10

[CI/Build] Replace `vllm.entrypoints.openai.api_server` entrypoint wi…

fa179abd

[Small] Prevent bypassing media domain restriction via HTTP redirects…

c5880cfa

EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…

da3a188b

[Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MT…

d737c66b

[Perf] Fix and reapply move apply w8a8 block fp8 linear to class (#25…

abc55b1f

Fix MTP with deepep_low_latency (#25904)

72c5dd03

[Bugfix] Disable cascade attention with FlashInfer (#26130)

0c76bb2d

[Log] Optimize DeepGEMM Missing Log (#26106)

587b30c5

[Bug][Benchmark] Fix duplicate req in oversampling (#26140)

8db7b7f3

[Attention] Move Backend enum into registry (#25893)

2ea7d486

[CI/Build] Conditionally register cutlass_fp4_group_mm to fix buildin…

173c8a95

[DeepSeek] Improve performance of DS MLA cache kernel (#26132)

a06bb9bf

[Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (#26144)

56d0073f

[gpt-oss] disable tool server initialization if no tool in request (#…

79b2fe7f

[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (…

218349d7

[ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (#2…

f35f896e

[Bugfix] Fix import `gemm_afp4wfp4` failure on AMD (#26068)

09b1a567

[Model] Use `merge_by_field_config` for MM models (G) (#26117)

bbeace23

`FusedMoE` support for the Transformers backend (#22650)

6b12b2ee

[BUG] Reorder model config creation (#26124)

d628fa1e

[Misc] Remove typing.List (#26150)

7e4b1861

[Input] Remove unused `prompt` field (#26097)

ae03f4c0

[Perf] Optimize `reshape_and_cache` CUDA Kernel (#25955)

5b80f220

add(v1): RequestStatesStats to RequestOutput (#24947)

edaae182

[Model] Use `merge_by_field_config` for MM models (InternVL family) (…

c81dc099

[test utils] correct wrong typing (#26159)

c6344152

[CI] Fix distributed hybrid tests in CI (#26155)

8d332b3c

[NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388)

2168fc8f

[openai] Fix missing tool usage check (system message) (#24768)

fa29d31f

[Multi Modal] Configurable MM Profiling (#25631)

2bcc7450

[Doc] Fixed shape description for fused_batched_moe.py (#25668)

564233d5

Quick fix for IMA with the Prefix Prefill kernel during graph capture…

f3768686

[Renderer] Move Processor out of AsyncLLM (#24138)

ff1daf6c

[Bugfix] Re-enable prefill of max model length (#24446)

7faf51f1

[backends][short_conv] CUDA graph piecewise edits (#24215)

c6f384da

[Model] Supplement to PR 24862: Pass param prefix to LLMHead (#25805)

fac9b430

[CI/Build] do not enforce precompilation on tpu ci tests (#25992)

d8b1f9cc

[Model] Fixed stream generator for gpt-oss + spec-decoding (#26027)

c40c0d9c

[Renderer] Move Processor out of LLMEngine (#26165)

611c23b6

Fix undefined symbol: cutlass_moe_mm_sm100 (#26098)

84135b14

[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduce…

e45271b0

Stop mergify from keeping stale PRs alive (#26169)

2d68bba3

Avoid division by zero in cache DS MLA kernel (#26174)

13e211bb

Fix V1 engine serialization error with Ray distributed executor (#26148)

9ea82ecd

[Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix…

920db411

mergify removed needs-rebase

Merge branch 'vllm-project:wye-refactor-w8a8-quant' into wye-refactor…

14bd6841

Merge branch 'main' into wye-refactor-w8a8-quant

13f8310d

mgoin commented on 2025-09-25

mgoin approved these changes on 2025-10-08

mgoin merged 241b4cfe into main 137 days ago

vllm
[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8`
#25293

Merged

[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` #25293

vllm [Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` #25293 Merged

[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` #25293

vllm
[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8`
#25293

Merged