tp loading: remove dead code, shared thread pool, allocator warmup
- Remove _batch_shard_for_scatter (dead fallback; per-rank loop works
uniformly for every TP class — Col/Row, Packed, Embedding, MoE, MLA)
- Remove _redistribute_async (dead function, never called)
- Reuse the existing thread_pool for per-batch disk reads instead of
spinning up a second ThreadPoolExecutor (with _ImmediateFuture
fallback for sync-load mode)
- Plan all batch layouts up front (pure CPU metadata, free), then warm
the caching allocator with one big torch.empty(peak_bytes) so the
hot loop's torch.empty calls carve from the pool instead of hitting
cudaMalloc each batch