feat: lazy TP weight loading via SafetensorsSlice
When tensor parallelism is active, each rank previously read the full
weight tensor from disk and then called .narrow() to extract its shard,
discarding the rest. This change introduces SafetensorsSlice, a lazy
wrapper around safetensors PySafeSlice that defers disk I/O until
materialization. The .narrow() calls from weight loaders are recorded
without touching the disk, and the actual read at materialize() time
fetches only the needed sub-region.
The __torch_function__ protocol auto-materializes when the slice is
used in any torch operation (e.g. param_data.copy_()), so no downstream
weight loader code needs changes. Unsupported operations (.reshape(),
.view(), .t(), etc.) also auto-materialize as a safety fallback.
Only affects the default "lazy" safetensors loading strategy.
Co-authored-by: Claude
https://claude.ai/code/session_01Ngt6Nm9BtPKAEied3djkuE