transformers
e186123d - tp loading: remove dead code, shared thread pool, allocator warmup

Commit
79 days ago
tp loading: remove dead code, shared thread pool, allocator warmup - Remove _batch_shard_for_scatter (dead fallback; per-rank loop works uniformly for every TP class — Col/Row, Packed, Embedding, MoE, MLA) - Remove _redistribute_async (dead function, never called) - Reuse the existing thread_pool for per-batch disk reads instead of spinning up a second ThreadPoolExecutor (with _ImmediateFuture fallback for sync-load mode) - Plan all batch layouts up front (pure CPU metadata, free), then warm the caching allocator with one big torch.empty(peak_bytes) so the hot loop's torch.empty calls carve from the pool instead of hitting cudaMalloc each batch
Author
Parents
Loading