vllm
fadfefcc - feat: lazy TP weight loading via SafetensorsSlice

Commit

33 days ago

feat: lazy TP weight loading via SafetensorsSlice When tensor parallelism is active, each rank previously read the full weight tensor from disk and then called .narrow() to extract its shard, discarding the rest. This change introduces SafetensorsSlice, a lazy wrapper around safetensors PySafeSlice that defers disk I/O until materialization. The .narrow() calls from weight loaders are recorded without touching the disk, and the actual read at materialize() time fetches only the needed sub-region. The __torch_function__ protocol auto-materializes when the slice is used in any torch operation (e.g. param_data.copy_()), so no downstream weight loader code needs changes. Unsupported operations (.reshape(), .view(), .t(), etc.) also auto-materialize as a safety fallback. Only affects the default "lazy" safetensors loading strategy. Co-authored-by: Claude https://claude.ai/code/session_01Ngt6Nm9BtPKAEied3djkuE

References

claude/optimize-weight-loading-7FlLd

Author

claude

Parents

09e4576f

vllm fadfefcc - feat: lazy TP weight loading via SafetensorsSlice

vllm
fadfefcc - feat: lazy TP weight loading via SafetensorsSlice