vllm
e0f7ae54 - [Frontend] Add multi-server frontend for K8s pod health aggregation

Commit
65 days ago
[Frontend] Add multi-server frontend for K8s pod health aggregation When running N vLLM API servers inside a single Kubernetes pod, a shared SO_REUSEPORT setup means K8s health probes only reach one server. If any backend crashes the pod can remain partially live. This PR adds --multi-server-frontend: a lightweight FastAPI process that runs on the main port (K8s-facing) and: 1. Aggregates /health across all N backends — returns 200 only when every backend is healthy, so liveness/startup probes work correctly. 2. Monitors backend processes and exits with code 1 if any crash, triggering a K8s pod restart instead of leaving a degraded pod. Port layout: --port → frontend (K8s-facing) --port+1..+N → vLLM backend servers (pod-internal) New files: vllm/entrypoints/openai/frontend.py Modified: vllm/entrypoints/openai/cli_args.py (add --multi-server-frontend flag) vllm/entrypoints/cli/serve.py (add run_multi_api_server_with_frontend) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Robert Shaw <robshaw@redhat.com>
Author
Robert Shaw
Parents
Loading