[FSDP] Dequeue one instead of flush (#86165)
For the rate limiter, I initially implemented the approach of only dequeueing a single event, but there was concern about blocking the CPU _every_ iteration. The landed approach instead blocks every `_max_num_inflight_all_gathers` iterations and flushes the entire queue.
However, upon further analysis, the approach of dequeueing a single event should be more performant with the same memory usage -- as the name suggests, both have `_max_num_inflight_all_gathers` concurrently inflight all-gathers. The cost of blocking the CPU thread is not important compared to the duration the CPU thread is actually blocked. This PR's approach reduces the latter quantity.
**Fast Communication; Slow Computation**
<img width="1235" alt="Screen Shot 2022-10-04 at 4 15 13 PM" src="https://user-images.githubusercontent.com/31054793/193917536-f1491803-9578-45ea-ba6e-e735c1bf7784.png">
**Slow Communication; Fast Computation**
<img width="718" alt="Screen Shot 2022-10-04 at 4 34 15 PM" src="https://user-images.githubusercontent.com/31054793/193921508-f2a4fd22-2b03-4a8e-b6ca-634c584c70e2.png">
**T5-11B**
2 nodes / 16 40 GB A100s with EFA and batch size 6:
- [Old] 5.81 s / batch; 24 and 20 CUDA malloc retries on local rank 0s; 35.234 GB peak active; 38.806 GB peak reserved
- [New] 5.10 s / batch; 25 and 29 CUDA malloc retries on local rank 0s; 35.234 GB peak active; 38.868 GB peak reserved
4 nodes / 32 40 GB A100s with EFA and batch size 7:
- [Old] 5.21 s / batch; 0, 0, 0, 0 CUDA malloc retries on local rank 0s; 33.695 GB peak active; 38.494 GB peak reserved
- [New] 4.93 s / batch; 1, 0, 0, 0 CUDA malloc retries on local rank 0s; 33.678 GB peak active; 38.792 GB peak reserved
The new version changes the fragmentation in the allocator. It is possible that by blocking the CPU thread more in the old approach, the initial blocks used to serve the all-gather stream allocations are different compared to the new approach. Even though the number of CUDA malloc retries increases slightly, the net result is a speedup with the new approach.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86165
Approved by: https://github.com/zhaojuanmao