[FSDP] Add rate limiter (#83917)
**Overview**
This PR adds a `bool` argument `limit_all_gathers` to the FSDP constructor, defaulted to `False`.
- Setting `limit_all_gathers=True` limits the max number of inflight all-gathers to 2 (an empirically chosen constant), preventing a fast CPU thread from over-allocating blocks to the all-gather stream.
- When experiencing a high number of CUDA malloc retries, the limiter can help reduce the number and hence lead to QPS improvement.
**Exploration**
I experimented with both a count-based limiter and size-based limiter (where the size is based on the inflight all-gather size in bytes).
- The size-based limiter did not provide any advantage, only confusing the developer and user alike on what threshold to set.
- For the count-based approach, I decided not to expose the max number of inflight all-gathers to the user since values other than 2 do not show improvements and exposing the knob may confuse users.
**T5-11B**
T5-11B evidences the performance gain from enabling the limiter and that a limit of 2 is a reasonable choice. This is run on an AWS cluster with 8 A100s per node and EFA. For both 2 and 4 nodes, we scale the batch size maximally before hitting OOM, which is a common practice.
<p float="left">
<img src="https://user-images.githubusercontent.com/31054793/188936036-04427da9-f492-4e50-9b35-ff64665d9815.png" width="400" />
<img src="https://user-images.githubusercontent.com/31054793/188936045-f44e659f-1e18-4ea7-8c78-0fce4ff8fb48.png" width="400" />
</p>
For 2 nodes, the limit of 2 yields 3.01x QPS improvement, and for 4 nodes, the limit of 2 yields 2.87x QPS improvement.
We need more data points, but the limiter may simplify the batch size scaling workflow. Normally, a practitioner may scale until hitting OOM and back off until there are few CUDA malloc retries. However, now the practitioner may be able to scale until hitting OOM and simply turn on the limiter to reduce the number of retries instead of backing off.
Differential Revision: [D39331201](https://our.internmc.facebook.com/intern/diff/D39331201)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83917
Approved by: https://github.com/zhaojuanmao