transformers
e186123d - tp loading: remove dead code, shared thread pool, allocator warmup

Commit

79 days ago

tp loading: remove dead code, shared thread pool, allocator warmup - Remove _batch_shard_for_scatter (dead fallback; per-rank loop works uniformly for every TP class — Col/Row, Packed, Embedding, MoE, MLA) - Remove _redistribute_async (dead function, never called) - Reuse the existing thread_pool for per-batch disk reads instead of spinning up a second ThreadPoolExecutor (with _ImmediateFuture fallback for sync-load mode) - Plan all batch layouts up front (pure CPU metadata, free), then warm the caching allocator with one big torch.empty(peak_bytes) so the hot loop's torch.empty calls carve from the pool instead of hitting cudaMalloc each batch

References

#45453 - Draft commit

Author

ArthurZucker

Parents

63d7486e

transformers e186123d - tp loading: remove dead code, shared thread pool, allocator warmup

transformers
e186123d - tp loading: remove dead code, shared thread pool, allocator warmup