PR #12570 Fix for attention layers to remain unquantized during moe_wn16 quant

Fix for attention layers to remain unquantized during moe_wn16 quant method

srikanthsrnvs committed 1 year ago

Set `?device={device}` when changing tab in installation guides (#12560)

srikanthsrnvs committed 1 year ago

[Misc] fix typo: add missing space in lora adapter error message (#12564)

srikanthsrnvs committed 1 year ago

[Kernel] Triton Configs for Fp8 Block Quantization (#11589)

srikanthsrnvs committed 1 year ago

[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)

srikanthsrnvs committed 1 year ago

[V1][Log] Add max request concurrency log to V1 (#12569)

srikanthsrnvs committed 1 year ago

[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868)

srikanthsrnvs committed 1 year ago

[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)

srikanthsrnvs committed 1 year ago

[Attention] MLA decode optimizations (#12528)

srikanthsrnvs committed 1 year ago

[Bugfix] Gracefully handle huggingface hub http error (#12571)

srikanthsrnvs committed 1 year ago

Format

srikanthsrnvs committed 1 year ago

Add favicon to docs (#12611)

srikanthsrnvs committed 1 year ago

[BugFix] Fix Torch.Compile For DeepSeek (#12594)

srikanthsrnvs committed 1 year ago

[Git] Automatically sign-off commits (#12595)

srikanthsrnvs committed 1 year ago

[Docs][V1] Prefix caching design (#12598)

srikanthsrnvs committed 1 year ago

[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)

srikanthsrnvs committed 1 year ago

[release] Add input step to ask for Release version (#12631)

srikanthsrnvs committed 1 year ago

[Bugfix] Revert MoE Triton Config Default (#12629)

srikanthsrnvs committed 1 year ago

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587)

srikanthsrnvs committed 1 year ago

[Feature] Fix guided decoding blocking bitmask memcpy (#12563)

srikanthsrnvs committed 1 year ago

[Doc] Improve installation signposting (#12575)

srikanthsrnvs committed 1 year ago

[Doc] int4 w4a16 example (#12585)

srikanthsrnvs committed 1 year ago

[V1] Bugfix: Validate Model Input Length (#12600)

srikanthsrnvs committed 1 year ago

[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161)

srikanthsrnvs committed 1 year ago

Fix target matching for fused layers with compressed-tensors (#12617)

srikanthsrnvs committed 1 year ago

[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599)

srikanthsrnvs committed 1 year ago

[Bugfix/CI] Fixup benchmark_moe.py (#12562)

srikanthsrnvs committed 1 year ago

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517)

srikanthsrnvs committed 1 year ago

[Attention] Deepseek v3 MLA support with FP8 compute (#12601)

srikanthsrnvs committed 1 year ago

[CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280)

srikanthsrnvs committed 1 year ago

Disable chunked prefill and/or prefix caching when MLA is enabled (#12642)

srikanthsrnvs committed 1 year ago

Apply torch.compile to fused_moe/grouped_topk (#12637)

srikanthsrnvs committed 1 year ago

doc: fixing minor typo in readme.md (#12643)

srikanthsrnvs committed 1 year ago

[Bugfix] fix moe_wna16 get_quant_method (#12648)

srikanthsrnvs committed 1 year ago

[Core] Silence unnecessary deprecation warnings (#12620)

srikanthsrnvs committed 1 year ago

[V1][Minor] Avoid frequently creating ConstantList (#12653)

srikanthsrnvs committed 1 year ago

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608)

srikanthsrnvs committed 1 year ago

[Hardware][Intel GPU] add XPU bf16 support (#12392)

srikanthsrnvs committed 1 year ago

[Misc] Add SPDX-License-Identifier headers to python source files (#12628)

srikanthsrnvs committed 1 year ago

[doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)

srikanthsrnvs committed 1 year ago

Merge branch 'main' into fix-moe-wna16-attention

srikanthsrnvs committed 1 year ago

unused imports

srikanthsrnvs committed 1 year ago

vllm Fix for attention layers to remain unquantized during moe_wn16 quant #12570 Merged

vllm
Fix for attention layers to remain unquantized during moe_wn16 quant
#12570

Merged