vllm
Fix for attention layers to remain unquantized during moe_wn16 quant
#12570
Merged

Commits
  • Fix for attention layers to remain unquantized during moe_wn16 quant method
    srikanthsrnvs committed 1 year ago
  • Set `?device={device}` when changing tab in installation guides (#12560)
    srikanthsrnvs committed 1 year ago
  • [Misc] fix typo: add missing space in lora adapter error message (#12564)
    srikanthsrnvs committed 1 year ago
  • [Kernel] Triton Configs for Fp8 Block Quantization (#11589)
    srikanthsrnvs committed 1 year ago
  • [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)
    srikanthsrnvs committed 1 year ago
  • [V1][Log] Add max request concurrency log to V1 (#12569)
    srikanthsrnvs committed 1 year ago
  • [Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868)
    srikanthsrnvs committed 1 year ago
  • [ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)
    srikanthsrnvs committed 1 year ago
  • [Attention] MLA decode optimizations (#12528)
    srikanthsrnvs committed 1 year ago
  • [Bugfix] Gracefully handle huggingface hub http error (#12571)
    srikanthsrnvs committed 1 year ago
  • Format
    srikanthsrnvs committed 1 year ago
  • Add favicon to docs (#12611)
    srikanthsrnvs committed 1 year ago
  • [BugFix] Fix Torch.Compile For DeepSeek (#12594)
    srikanthsrnvs committed 1 year ago
  • [Git] Automatically sign-off commits (#12595)
    srikanthsrnvs committed 1 year ago
  • [Docs][V1] Prefix caching design (#12598)
    srikanthsrnvs committed 1 year ago
  • [v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)
    srikanthsrnvs committed 1 year ago
  • [release] Add input step to ask for Release version (#12631)
    srikanthsrnvs committed 1 year ago
  • [Bugfix] Revert MoE Triton Config Default (#12629)
    srikanthsrnvs committed 1 year ago
  • [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587)
    srikanthsrnvs committed 1 year ago
  • [Feature] Fix guided decoding blocking bitmask memcpy (#12563)
    srikanthsrnvs committed 1 year ago
  • [Doc] Improve installation signposting (#12575)
    srikanthsrnvs committed 1 year ago
  • [Doc] int4 w4a16 example (#12585)
    srikanthsrnvs committed 1 year ago
  • [V1] Bugfix: Validate Model Input Length (#12600)
    srikanthsrnvs committed 1 year ago
  • [BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161)
    srikanthsrnvs committed 1 year ago
  • Fix target matching for fused layers with compressed-tensors (#12617)
    srikanthsrnvs committed 1 year ago
  • [ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599)
    srikanthsrnvs committed 1 year ago
  • [Bugfix/CI] Fixup benchmark_moe.py (#12562)
    srikanthsrnvs committed 1 year ago
  • Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517)
    srikanthsrnvs committed 1 year ago
  • [Attention] Deepseek v3 MLA support with FP8 compute (#12601)
    srikanthsrnvs committed 1 year ago
  • [CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280)
    srikanthsrnvs committed 1 year ago
  • Disable chunked prefill and/or prefix caching when MLA is enabled (#12642)
    srikanthsrnvs committed 1 year ago
  • Apply torch.compile to fused_moe/grouped_topk (#12637)
    srikanthsrnvs committed 1 year ago
  • doc: fixing minor typo in readme.md (#12643)
    srikanthsrnvs committed 1 year ago
  • [Bugfix] fix moe_wna16 get_quant_method (#12648)
    srikanthsrnvs committed 1 year ago
  • [Core] Silence unnecessary deprecation warnings (#12620)
    srikanthsrnvs committed 1 year ago
  • [V1][Minor] Avoid frequently creating ConstantList (#12653)
    srikanthsrnvs committed 1 year ago
  • [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608)
    srikanthsrnvs committed 1 year ago
  • [Hardware][Intel GPU] add XPU bf16 support (#12392)
    srikanthsrnvs committed 1 year ago
  • [Misc] Add SPDX-License-Identifier headers to python source files (#12628)
    srikanthsrnvs committed 1 year ago
  • [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)
    srikanthsrnvs committed 1 year ago
  • Merge branch 'main' into fix-moe-wna16-attention
    srikanthsrnvs committed 1 year ago
  • unused imports
    srikanthsrnvs committed 1 year ago
Loading