vllm
fadfefcc - feat: lazy TP weight loading via SafetensorsSlice

Commit
33 days ago
feat: lazy TP weight loading via SafetensorsSlice When tensor parallelism is active, each rank previously read the full weight tensor from disk and then called .narrow() to extract its shard, discarding the rest. This change introduces SafetensorsSlice, a lazy wrapper around safetensors PySafeSlice that defers disk I/O until materialization. The .narrow() calls from weight loaders are recorded without touching the disk, and the actual read at materialize() time fetches only the needed sub-region. The __torch_function__ protocol auto-materializes when the slice is used in any torch operation (e.g. param_data.copy_()), so no downstream weight loader code needs changes. Unsupported operations (.reshape(), .view(), .t(), etc.) also auto-materialize as a safety fallback. Only affects the default "lazy" safetensors loading strategy. Co-authored-by: Claude https://claude.ai/code/session_01Ngt6Nm9BtPKAEied3djkuE
Author
Parents
Loading