vllm
[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8`
#25293
Merged

[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` #25293

yewentao256
yewentao256 yewentao256 requested a review from tlrmchlsmth tlrmchlsmth 156 days ago
yewentao256 yewentao256 requested a review from LucasWilkinson LucasWilkinson 156 days ago
yewentao256 yewentao256 requested a review from mgoin mgoin 156 days ago
yewentao256 yewentao256 requested a review from robertgshaw2-redhat robertgshaw2-redhat 156 days ago
yewentao256 yewentao256 added ready
mergify mergify added ci/build
mergify mergify added rocm
gemini-code-assist
gemini-code-assist commented on 2025-09-19
yewentao256 yewentao256 changed the title [DO NOT MERGE] Test [Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8` 156 days ago
yewentao256
mergify
mergify mergify added needs-rebase
mergify mergify removed needs-rebase
mergify
mergify mergify added needs-rebase
yewentao256 yewentao256 requested a review from sighingnow sighingnow 142 days ago
yewentao256 yewentao256 requested a review from 22quinn 22quinn 142 days ago
mergify mergify added qwen
nicole-lihui optimize: eliminate duplicate split_enc_dec_inputs calls (#25573)
d8ffa3c5
courage17340 [Bugfix] fix apply_temperature to avoid nan in probs (#24734)
94b78f57
DarkLight1337 [Misc] Simplify PoolerOutput and move to `v1/outputs` (#25629)
8b17d255
jacobkahn Map CwmForCausalLM to llama and LlamaForCausalLM (#25611)
004eed39
nicole-lihui typo: remove duplicate `is` (#25641)
6a437a41
tlrmchlsmth Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class…
6c6e5536
fadara01 [fix] Update torch version in cpu-build.txt for AArch64/ppc64le and D…
5e16b8c5
ywang96 [Misc] Fix Qwen3-VL `video_grid_thw` typing (#25646)
fd28c588
adobrzyn [Bugfix] Add triton.language.tensor placeholder (#25649)
034c0152
Isotr0py [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video prof…
f17d37b0
DarkLight1337 [mypy] Further improve MM type annotations (#25654)
686cfd91
yyzxw [Bugfix] Parse SpeculativeConfig Error (#25142)
3d940e2c
noooop [V0 deprecation] Remove unreachable model_config.supported_tasks (#25…
f3d9099b
hmellor Add backward compatibility for `guided_...` API (#25615)
22114ffe
DarkLight1337 [CI/Build] Fix flaky entrypoints test (#25663)
22241131
jikunshang [XPU][Triton]add xpu config in triton_reshape_and_cache_flash (#25643)
d7f6489f
langc23 [Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112)
a88371f8
DarkLight1337 [mypy] Fix wrong type annotations related to tuple (#25660)
af10a37c
youkaichao [misc] log info messages by default for hanging / busy / idle (#25627)
a5fa821b
jmkuebler [torch.compile] Make Query Quantization Fusable (#24914)
18c20257
bigPYJ1151 [CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata (#…
2469b829
russellb [ux] Switch a warning to debug about a pytorch fallback (#23750)
054c8b52
Isotr0py [Bugfix] Fix InternS1 video processing after Transformers v4.56 (#25644)
f7f76a86
NickLucche [Misc] Remove cruft file in repo (#25678)
91d42997
tlrmchlsmth [Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning …
2655d7ab
AlonKejzman [BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin…
252a0ff8
mgoin Revert "[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_func…
0cee734a
LucasWilkinson [BugFix] Fix DBO hang (#25625)
fe6357a7
taohui [Model] Add optional parameter to reasoning parser constructor (#25554)
c7ca3c5d
DarkLight1337 [Model] Define `merge_by_field_config` MM interface (#25676)
34e6a31e
Isotr0py [V0 deprecation] Clean up V0 fallback in compilation config (#25675)
9659b7e7
MatthewBonanni [V0 deprecation] Remove _VLLM_V1 suffixes from attention backend name…
a3555612
jeejeelee [V0 deprecation] Clean up LoRA (#25686)
80385959
DarkLight1337 [Misc] Simplify `test_argsort_mm_positions` (#25690)
b0e9f04b
DarkLight1337 [Optimization] Streamline `InputPreprocessor` (#25702)
745b204d
DarkLight1337 [Optimization] Use a cheaper cache key in `get_model_architecture` (#…
b558c3a8
ekagra-ranjan [Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)
f3a478b5
zhuohan123 [Core] Enable command line logging for LLMEngine (#25610)
37d83608
tomeras91 [Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 (#25708)
1d1436c3
wenscarl Fix routing_bias dtype (#25711)
1d210801
yewentao256 [Refactor] Remove DeepGEMM OP Register (#25710)
3a32aa8a
njhill [Misc] Don't log shm dequeue delay warning on worker side (#25720)
6f97de4e
maleksan85 Llamas 3.1 405B fp4 changes upstreaming from 355_wip (#25135)
c064c826
russellb [Core] Force PIECEWISE CUDAGraph mode for encoder-decoder (#25701)
ef160aa0
njhill [Misc] Remove unnecessary memoryviews in shm_broadcast.py (#25721)
6ada2212
BloodAxe EVS Support (Video tokens pruning) (#22980)
9e6628cc
yitingdc [CI/Build] fix doc build warning: Failed to get 'name: description' p…
e82e3b55
qthequartermasterman fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid h…
74ea69f4
qthequartermasterman perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds…
b2d5d423
xaguilar-amd [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300…
79586c54
Iceber fix: print outputt offline_inference/base/chat.py example (#25744)
0aea9348
sighingnow [Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and …
067fe8b1
wxsIcey Remove cuda hard-code in compute_causal_conv1d_metadata (#25555)
bc37468b
yyzxw [misc] refactor speculative config (#25657)
c761b84d
SageMoore [Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk…
fa55373a
Xu-Wenqing Support LongCat-Flash-Chat tool call (#24083)
ced693e8
DarkLight1337 [Doc] Update Batch-level DP docs (#25757)
87ee8535
cyang49 [Model] Mamba2 varlen refactor (#21467)
62ae26c8
chaunceyjiang [CI] Fix test_shared_storage_connector_hashes (#25748)
515e30b0
noooop [Bugfix] Properly abort pooling request. (#25734)
fb0eece2
DarkLight1337 [CI/Build] Split up Distributed Tests (#25572)
d3c732e9
DarkLight1337 [CI/Build] Fix some V1 tests not being run (#25569)
129a643b
Isotr0py [Quantization] Add field to skip unquantized modules for GPTQ config …
d70c1549
LucasWilkinson [BugFix] Fix using `dbo_decode_token_threshold` always (and ignoring …
6ca8d975
eicherseiji [ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility i…
41174e28
ZJY0516 [Misc] fix unique_filepath (#25732)
c7229821
LDLINGLINGLING Eagle3 that supports the Minicpm3 model (#24243)
e0175fbf
brokedba [Doc]: improve CPU(x86) build-wheel-from-source section (#25617)
8c1b61bd
frankwang28 [Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…
f16c440c
mgoin [Docs] Add Toronto Meetup (#25773)
51577819
mgoin [CI] Add E2E Blackwell Quantized MoE Test (#25723)
b6f16d37
fhl2000 [V1] address post issues related to #20059 (part 1) (#23046)
ceb34601
mgoin [CI] Fix FlashInfer AOT in release docker image (#25730)
dc191cc5
zixi-qi [spec decode] Consolidate speculative decode method name for MTP (#25…
1356ae0a
SageMoore Reduce the Cuda Graph memory footprint when running with DBO (#25779)
dbdea93f
bwasti Kernel-override Determinism [1/n] (#25603)
c4b9864e
namanlalitnyu [Bugfix] Optimize CpuGpuBuffer initialization (#25447)
e7cba8f6
jmkuebler [Spec decode] automatically disable mm for text-only draft models (#2…
93ba7648
zhuohan123 [Core] Don't count preempted tokens in prefix cache hit rate (#25787)
806b292c
russellb Add option to restrict media domains (#25783)
dbb7782d
mgoin Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…
55971f85
david6666666 [Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement…
38c2df83
frankwang28 [Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…
f16c440c
mgoin [Docs] Add Toronto Meetup (#25773)
51577819
mgoin [CI] Add E2E Blackwell Quantized MoE Test (#25723)
b6f16d37
fhl2000 [V1] address post issues related to #20059 (part 1) (#23046)
ceb34601
mgoin [CI] Fix FlashInfer AOT in release docker image (#25730)
dc191cc5
SageMoore Reduce the Cuda Graph memory footprint when running with DBO (#25779)
dbdea93f
bwasti Kernel-override Determinism [1/n] (#25603)
c4b9864e
namanlalitnyu [Bugfix] Optimize CpuGpuBuffer initialization (#25447)
e7cba8f6
jmkuebler [Spec decode] automatically disable mm for text-only draft models (#2…
93ba7648
zhuohan123 [Core] Don't count preempted tokens in prefix cache hit rate (#25787)
806b292c
russellb Add option to restrict media domains (#25783)
dbb7782d
mgoin Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…
55971f85
22quinn [CI/Build] Add timing to Model Executor Test (#25799)
a8913725
ywang96 [Misc] Update openai client example file for multimodal (#25795)
d7cf3783
Renovamen [Bugfix] Add missing `image_size` for phi4_multimodal (#25796)
6970fa99
russellb Validate API tokens in constant time (#25781)
3e7f33c8
hmellor Fix GPTQ model loading in Transformers backend (#25770)
c7ae7edb
tlrmchlsmth [Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982)
e94aabe0
yyzxw [docs] transcriptions API audio upload (#25446)
7d92e508
panpan0000 [env] default nixl side port conflicts with kv-event zmq port (#25056)
9b4c7521
patrick-toulme [Core] Refactor self.model() to call a helper for subclassing. (#25084)
7b28ef2b
ZJY0516 [torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (#25651)
d8fc00d6
smarterclayton [Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (#…
942fba38
ywang96 [Bugfix] Fix Qwen3-VL regression from #24982 (#25814)
495f3682
Isotr0py [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurab…
6dee906d
zRzRzRzRzRzRzR Update GLM-4.5 Doc transformers version (#25830)
e40c1269
mgoin Remove redundant cudagraph dispatcher warning (#25841)
cf0a7912
kingsmad [Misc] fix tests failure by using current_platform (#25825)
eb447aff
robertgshaw2-redhat [P/D] NIXL Updates (#25844)
70ba2d1e
ywang96 [Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851)
4079a63a
Isotr0py [V0 Deprecation][Models] Remove all V0 condition for mm embeddings me…
b765adcc
DarkLight1337 [Misc] Remove more `get_input_embeddings_v0` (#25857)
ea55445b
youkaichao update to latest deepgemm for dsv3.2 (#25871)
770a2cf7
yingjun-mou [Bugfix] Fix requirements paths in install instructions (#25827)
85d43060
zhoukezi [Model][Bugfix] Fix issues in MiDashengLM implementation for quantize…
4e2774f5
ZJY0516 [torch.compile] serialize cudagraph_mode as its enum name instead of …
9f78b9ca
chenxi-yang [Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690)
f84b2a0d
rahul-tuli [Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (…
c3399215
Isotr0py [CI/Build] Include Transformers backend test in nightly transformers …
616bce15
leejnau [Bugfix] Use correct key "ignore" for config.json non-quantized layer…
9555929e
adabeyta [BugFix][torch.compile] KV scale calculation issues with FP8 quantiza…
c692506e
namanlalitnyu [Doc] Add documentation for vLLM continuous benchmarking and profilin…
ae0c3592
gshtras [Bugfix][ROCm] Fixing trying to import non-existent symbols from libn…
e7203c23
tdoublep [Kernel] Chunk-aligned mamba2 (#24683)
b7973eab
zhuohan123 [Doc] Polish example for torchrun dp (#25899)
4deb9c88
aarnphm [V0 Deprecation] Remove `vllm.worker` and update according imports (#…
97f1312f
qthequartermasterman Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT…
6941d53c
hmellor Move`VllmConfig` from `config/__init__.py` to `config/vllm.py` (#25271)
ea7cf8db
LucasWilkinson [BugFix] Fix DP/EP hang (#25906)
e165f980
acisseJZhong [BugFix] Pass config_format via try_get_generation_config (#25912)
db4a03e2
simondanielsson [Bugfix]: Clean up chunked prefill logging when using whisper (#25075)
da716513
sergiopaniego Updated TRL integration docs (#25684)
c0734fc5
CSWYF3634076 [Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (#25936)
9dce93e0
DarkLight1337 [Model] Move `vision_feature_select_strategy` into `resolve_visual_en…
a1898466
lhtin [perf] Use CPU tensor to reduce GPU->CPU sync (#25884)
eea2536a
NickLucche [NIXL] Add support for MLA caches with different latent dim (#25902)
bf8bb7e2
rzabarazesh [CI] Move applicable tests to CPU (#24080)
8914d528
ihb2032 [Fix] Improve CPU backend compatibility for RISC-V (#25816)
02776c03
Josephasafg [Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…
d9f8ded1
sergiopaniego Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)
b6ea29b7
Anionex [Bugfix][Model] Fix inference for Hunyuan dense models (#25354)
ea6144a0
pavanimajety [Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#2…
8c52fccb
DarkLight1337 [Bugfix] Token type and position embeddings fail to be applied to `in…
e33579cd
youkaichao [bugfix][deepseek] fix flashmla kernel selection (#25956)
206ab1f0
yewentao256 [Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute…
3c75d3b0
DarkLight1337 [Doc] Improve MM Pooling model documentation (#25966)
493acdb7
bnellnm [Docs] Add moe kernel features doc (#25297)
6083b4d9
orozery OffloadingConnector: Fix GPU block tracking bug (#25856)
bb2e04e4
cjackal [Llama4] [multimodal] Fix misplaced dtype cast of `cos_sin_cache` in …
8ecccdd1
jeejeelee [Bench] Add DeepSeekV32 to MoE benchmark (#25962)
ef318228
sdavidbd [V1] [P/D] Add Support for KV Load Failure Recovery (#19330)
8328d39d
hmellor Add explicit pooling classes for the Transformers backend (#25322)
b3e1846d
hmellor [Docs] Remove API Reference from search index (#25949)
16909544
qandrew [gpt-oss] use vLLM instead of openai types for streaming (#25186)
fd56f2e6
LucasWilkinson [Misc] Make EP kernels install script support uv (#25785)
e734a2a0
luccafong [Model] MTP fallback to eager for DeepSeek v32 (#25982)
d437ba32
DrStone1971 Update launch_bounds_utils.h for correct compile on Multiple Cuda Arc…
04cb503f
yewentao256 [Log] Optimize Log for FP8MOE (#25709)
2b6b8599
certainly-param Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)
cd0bbf5d
ywang96 [MM] Add text-only mode for Qwen3-VL (#26000)
4c094b33
zhewenl [Bugfix] Fix `__syncwarp` on ROCM (#25996)
6444f65a
LucasWilkinson [BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (#25988)
7c795fdf
hmellor Update to Transformers `v4.56.2` (#24638)
fda81983
luccafong [Misc]allow disable pynccl (#25421)
9506409f
vnadathur [Doc] updating torch.compile doc link (#25989)
b9ed8c96
wwl2755 [BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-t…
25e5b9cc
DarkLight1337 [Misc] Factor out common `_apply_feature_select_strategy` (#26003)
63c56cbb
hmellor [CI] Only capture a single CUDA graph size in CI by default (#25951)
e8773e62
billishyahao [MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…
a561b983
natoscott [Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …
aeff0604
kmaehashi [Bugfix] Apply same sampling parameters for both `n=1` and `n>1` (#26…
0944358a
johnnynunez [NVIDIA] Blackwell Family (#24673)
ed7eb771
hl475 Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…
d2f54401
mgoin [CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)
bba76234
LucasWilkinson [BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026)
90529cec
hyoon1 [ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)
d4a83e01
yewentao256 [Bug] Fix Negative Cuda Memory Usage (#25683)
ce8ee3d9
LucasWilkinson [BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)
ac1598d1
jerryzh168 Support RL online quantization with torchao (#23014)
2ae74a80
gshtras [ROCm][Bugfix] Add missing parameter to ROCm backend (#26029)
91e10c72
DarkLight1337 [Model] Move `vision_feature_select_strategy` into `resolve_visual_en…
a1898466
lhtin [perf] Use CPU tensor to reduce GPU->CPU sync (#25884)
eea2536a
NickLucche [NIXL] Add support for MLA caches with different latent dim (#25902)
bf8bb7e2
Josephasafg [Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…
d9f8ded1
sergiopaniego Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)
b6ea29b7
Anionex [Bugfix][Model] Fix inference for Hunyuan dense models (#25354)
ea6144a0
DarkLight1337 [Bugfix] Token type and position embeddings fail to be applied to `in…
e33579cd
orozery OffloadingConnector: Fix GPU block tracking bug (#25856)
bb2e04e4
jeejeelee [Bench] Add DeepSeekV32 to MoE benchmark (#25962)
ef318228
hmellor Add explicit pooling classes for the Transformers backend (#25322)
b3e1846d
qandrew [gpt-oss] use vLLM instead of openai types for streaming (#25186)
fd56f2e6
LucasWilkinson [Misc] Make EP kernels install script support uv (#25785)
e734a2a0
luccafong [Model] MTP fallback to eager for DeepSeek v32 (#25982)
d437ba32
certainly-param Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)
cd0bbf5d
hmellor Update to Transformers `v4.56.2` (#24638)
fda81983
luccafong [Misc]allow disable pynccl (#25421)
9506409f
vnadathur [Doc] updating torch.compile doc link (#25989)
b9ed8c96
hmellor [CI] Only capture a single CUDA graph size in CI by default (#25951)
e8773e62
billishyahao [MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes <…
a561b983
natoscott [Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type …
aeff0604
johnnynunez [NVIDIA] Blackwell Family (#24673)
ed7eb771
hl475 Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_in…
d2f54401
mgoin [CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030)
bba76234
hyoon1 [ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908)
d4a83e01
yewentao256 [Bug] Fix Negative Cuda Memory Usage (#25683)
ce8ee3d9
LucasWilkinson [BugFix] ChunkedLocalAttention is currently not CG compatible (#26034)
ac1598d1
njhill [Misc] Make handling of SamplingParams clearer in n>1 case (#26032)
93d2be10
DarkLight1337 [CI/Build] Replace `vllm.entrypoints.openai.api_server` entrypoint wi…
fa179abd
huachenheli [Small] Prevent bypassing media domain restriction via HTTP redirects…
c5880cfa
ekagra-ranjan EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 3…
da3a188b
heheda12345 [Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MT…
d737c66b
ElizaWszola [Perf] Fix and reapply move apply w8a8 block fp8 linear to class (#25…
abc55b1f
MatthewBonanni Fix MTP with deepep_low_latency (#25904)
72c5dd03
mgoin [Bugfix] Disable cascade attention with FlashInfer (#26130)
0c76bb2d
yewentao256 [Log] Optimize DeepGEMM Missing Log (#26106)
587b30c5
ekagra-ranjan [Bug][Benchmark] Fix duplicate req in oversampling (#26140)
8db7b7f3
MatthewBonanni [Attention] Move Backend enum into registry (#25893)
2ea7d486
mgoin [CI/Build] Conditionally register cutlass_fp4_group_mm to fix buildin…
173c8a95
MatthewBonanni [DeepSeek] Improve performance of DS MLA cache kernel (#26132)
a06bb9bf
benchislett [Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (#26144)
56d0073f
qandrew [gpt-oss] disable tool server initialization if no tool in request (#…
79b2fe7f
tlrmchlsmth [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (…
218349d7
tjtanaa [ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (#2…
f35f896e
zhewenl [Bugfix] Fix import `gemm_afp4wfp4` failure on AMD (#26068)
09b1a567
DarkLight1337 [Model] Use `merge_by_field_config` for MM models (G) (#26117)
bbeace23
hmellor `FusedMoE` support for the Transformers backend (#22650)
6b12b2ee
hao-aaron [BUG] Reorder model config creation (#26124)
d628fa1e
varun-sundar-rabindranath [Misc] Remove typing.List (#26150)
7e4b1861
DarkLight1337 [Input] Remove unused `prompt` field (#26097)
ae03f4c0
ZJY0516 [Perf] Optimize `reshape_and_cache` CUDA Kernel (#25955)
5b80f220
huijjj add(v1): RequestStatesStats to RequestOutput (#24947)
edaae182
DarkLight1337 [Model] Use `merge_by_field_config` for MM models (InternVL family) (…
c81dc099
yannicks1 [test utils] correct wrong typing (#26159)
c6344152
tdoublep [CI] Fix distributed hybrid tests in CI (#26155)
8d332b3c
NickLucche [NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388)
2168fc8f
levunet [openai] Fix missing tool usage check (system message) (#24768)
fa29d31f
wwl2755 [Multi Modal] Configurable MM Profiling (#25631)
2bcc7450
Egor-Krivov [Doc] Fixed shape description for fused_batched_moe.py (#25668)
564233d5
SageMoore Quick fix for IMA with the Prefix Prefill kernel during graph capture…
f3768686
KKSK-DON [Renderer] Move Processor out of AsyncLLM (#24138)
ff1daf6c
yannicks1 [Bugfix] Re-enable prefill of max model length (#24446)
7faf51f1
paulpak58 [backends][short_conv] CUDA graph piecewise edits (#24215)
c6f384da
whx-sjtu [Model] Supplement to PR 24862: Pass param prefix to LLMHead (#25805)
fac9b430
sixiang-google [CI/Build] do not enforce precompilation on tpu ci tests (#25992)
d8b1f9cc
astralord [Model] Fixed stream generator for gpt-oss + spec-decoding (#26027)
c40c0d9c
DarkLight1337 [Renderer] Move Processor out of LLMEngine (#26165)
611c23b6
jasl Fix undefined symbol: cutlass_moe_mm_sm100 (#26098)
84135b14
xuechendi [BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduce…
e45271b0
hmellor Stop mergify from keeping stale PRs alive (#26169)
2d68bba3
MatthewBonanni Avoid division by zero in cache DS MLA kernel (#26174)
13e211bb
nrghosh Fix V1 engine serialization error with Ray distributed executor (#26148)
9ea82ecd
pavanimajety [Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix…
920db411
mergify mergify removed needs-rebase
yewentao256 Merge branch 'vllm-project:wye-refactor-w8a8-quant' into wye-refactor…
14bd6841
yewentao256 Merge branch 'main' into wye-refactor-w8a8-quant
13f8310d
mgoin
mgoin commented on 2025-09-25
mgoin
mgoin approved these changes on 2025-10-08
mgoin mgoin merged 241b4cfe into main 137 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone