Upgrade vLLM from 0.10.1.1 to 0.14.1 (#1173)
* Upgrade vLLM from 0.10.1.1 to 0.14.1
- Update pyproject.toml to vllm>=0.11.0
- Fix deprecated import: vllm.transformers_utils.tokenizer -> vllm.tokenizers
- Add comprehensive test suite for V1 engine compatibility
- Add smoke tests for quick validation
Changes:
- pyproject.toml: Updated vllm version constraint
- vllm_model.py: Updated get_tokenizer import path
- llm_as_judge.py: Updated get_tokenizer import path
- Added smoke_test_vllm_v11.py: Quick validation tests
- Added test_vllm_v1_compatibility.py: Comprehensive compatibility tests
All tests passing - V1 engine compatible, basic inference working.
* Fix vLLM slow test OOM by reducing GPU memory utilization and improving cleanup
The vLLM slow tests were failing with OOM errors when running after
accelerate tests. The issue was:
1. vLLM V1 engine requires a specific amount of free GPU memory at startup
2. After accelerate tests, only 5.89 GiB was free (out of 14.74 GiB)
3. vLLM with gpu_memory_utilization=0.6 wanted 8.84 GiB
Fixes:
- Reduce gpu_memory_utilization from 0.6 to 0.35 in test config (needs 5.16 GiB)
- Add GPU memory cleanup fixture in conftest.py that runs before/after slow tests
- Improve AsyncVLLMModel.cleanup() to properly delete model object
The gpu_memory_utilization parameter only affects KV cache allocation and
does not impact model outputs with temperature=0.0, so this change is safe.
* Fix vLLM CI test by increasing gpu_memory_utilization to 0.4
The CI test was failing with 'ValueError: To serve at least one request
with the model's max seq len (8192), 1.5 GiB KV cache is needed, which
is larger than the available KV cache memory (1.42 GiB).'
Root cause:
- Tesla T4 GPU (15.36 GB) in CI environment
- With gpu_memory_utilization=0.35, only 1.42 GiB available for KV cache
- Required 1.5 GiB for max_seq_len=8192
- Shortfall: 80 MB
Fix:
- Increase gpu_memory_utilization from 0.35 to 0.4
- Now provides ~1.62 GiB for KV cache (sufficient for 1.5 GiB requirement)
- Does not affect model outputs with temperature=0.0 (deterministic)
* Fix vLLM CI test and add GPU memory monitoring
This commit addresses two issues:
1. Fix vLLM engine initialization failure in CI
- Root cause: Triton library requires Python.h headers to compile CUDA utilities
- Solution: Install python3.10-dev package in CI workflow
- Error was: 'fatal error: Python.h: No such file or directory'
2. Add comprehensive GPU memory monitoring for slow tests
- Add _log_gpu_memory() helper function in conftest.py
- Log GPU memory before/after each slow test (device, total, allocated, reserved, free)
- Add memory logging to model cleanup methods:
* VLLMModel.cleanup()
* AsyncVLLMModel.cleanup()
* TransformersModel.cleanup()
- Shows memory freed during cleanup operations
This will help diagnose OOM issues and verify proper memory cleanup between tests.
Changes:
- .github/workflows/slow_tests.yaml: Add python3.10-dev installation step
- tests/conftest.py: Add GPU memory monitoring helper + enhanced fixture
- src/lighteval/models/vllm/vllm_model.py: Add memory logging to cleanup methods
- src/lighteval/models/transformers/transformers_model.py: Add memory logging to cleanup
* Fix vLLM CI: Add CUDA environment setup for FlashInfer JIT compilation
The vLLM test was failing because FlashInfer needs nvcc (CUDA compiler)
for JIT kernel compilation during warmup. The error was:
'RuntimeError: Could not find nvcc and default cuda_home="/usr/local/cuda" doesn't exist'
Fixes:
- Set CUDA_HOME=/usr/local/cuda-12.4 environment variable
- Add /usr/local/cuda-12.4/bin to PATH for nvcc access
- This allows FlashInfer to JIT-compile custom attention kernels
Previous fixes in this PR:
- ✅ Installed python3.10-dev for Python.h headers (Triton compilation)
- ✅ Increased gpu_memory_utilization from 0.35 to 0.4 for KV cache
- ✅ Added comprehensive GPU memory monitoring
GPU memory stats show plenty of free memory (14.71 GiB of 14.74 GiB),
so the issue is purely build-time tooling for JIT compilation.
* Fix vLLM CI: Pass CUDA environment variables to test subprocess
The vLLM v1 engine spawns subprocesses that don't inherit environment
variables set in . The previous fix set CUDA_HOME in the
GitHub Actions environment, but the vLLM EngineCore subprocess couldn't
access it, causing:
'/bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found'
Fix:
- Set CUDA_HOME and PATH directly in the test run command
- This ensures the environment variables are inherited by all subprocesses
- Now nvcc will be found during FlashInfer JIT compilation
The issue was subprocess environment isolation, not the parent environment.
* Install CUDA Toolkit 12.8 in CI for vLLM FlashInfer JIT compilation
- Add CUDA Toolkit 12.8 installation step to match nvidia-cuda-runtime-cu12==12.8.90
- Cache /usr/local/cuda-12.8 to speed up subsequent CI runs
- Add verification step to check nvcc availability
- Update CUDA_HOME and PATH to use CUDA 12.8
- Use export in test run to ensure subprocess inherits environment variables
This fixes the issue where vLLM v0.15.x with FlashInfer backend requires
nvcc at runtime for JIT compilation of CUDA kernels on Tesla T4 (SM 7.5).
Resolves: /bin/sh: 1: /usr/local/cuda-12.4/bin/nvcc: not found
* Fix vLLM v0.15.x API compatibility: use max_model_len instead of max_seq_len_to_capture
- Replace model.llm_engine.model_config.max_seq_len_to_capture with max_model_len
- Replace model.model_config.max_seq_len_to_capture with max_model_len for async model
- This attribute was renamed in vLLM v0.15.x
Fixes: AttributeError: 'ModelConfig' object has no attribute 'max_seq_len_to_capture'
* Fix vLLM v0.15.x generate() API: use prompts parameter instead of prompt_token_ids
- Replace prompt_token_ids= with prompts= in LLM.generate() calls
- Update both VLLMModel and AsyncVLLMModel
- Update llm_as_judge.py for VLLM backend
In vLLM v0.15.x, the LLM.generate() method signature changed:
- Old: generate(prompt_token_ids=..., sampling_params=...)
- New: generate(prompts=..., sampling_params=...)
Fixes: TypeError: LLM.generate() got an unexpected keyword argument 'prompt_token_ids'
* Fix vLLM v0.15.x prompt_logprobs API: increase top-k and handle dict structure
In vLLM v0.15.x, the prompt_logprobs structure changed:
- Now returns dict[int, Logprob] at each position (FlatLogprobs class)
- Only contains top-k tokens (default was 1, causing KeyError for continuation tokens)
- Need to access logprobs_at_position[token] instead of direct dict access
Changes:
1. Increase prompt_logprobs from 1 to 20 to ensure continuation tokens are included
2. Add defensive error handling with helpful message if token not found
3. Update variable names for clarity (logprobs -> logprobs_at_position)
* Fix vLLM v0.15.x logprobs API compatibility
* working omg
* revert
* revert
* revert
* Fix slow_tests workflow: update Python dev headers from 3.10 to 3.12
The GitHub Actions runner uses Python 3.12.3, so installing python3.10-dev
fails with 'Unable to locate package'. This updates the workflow to install
python3.12-dev to match the runner's Python version.
* lower memory need
* add debug prints
* upgrade ruff
* upgrade ruff
* fix dependencies
* fix dependencies