Use `fastAtomicAdd` in EmbeddingBag (mode "max") backward (#63298)
Summary:
Rel: https://github.com/pytorch/pytorch/issues/62695
### This PR
| n_tokens | num_embeddings | embedding_dim | mode | bwd_fp32 | bwd_fp16 |
|-----------:|-----------------:|----------------:|:-------|------------:|------------:|
| 4096 | 4096 | 4096 | max | 0.000326228 | 0.000181448 |
| 4096 | 4096 | 16384 | max | 0.00102805 | 0.000618136 |
| 4096 | 16384 | 4096 | max | 0.000907326 | 0.000530422 |
| 4096 | 16384 | 16384 | max | 0.00334988 | 0.00264645 |
| 16384 | 4096 | 4096 | max | 0.000366449 | 0.000320232 |
| 16384 | 4096 | 16384 | max | 0.00126421 | 0.00104183 |
| 16384 | 16384 | 4096 | max | 0.00087738 | 0.00065068 |
| 16384 | 16384 | 16384 | max | 0.00379229 | 0.00298201 |
### Original
| n_tokens | num_embeddings | embedding_dim | mode | bwd_fp32 | bwd_fp16 |
|-----------:|-----------------:|----------------:|:-------|------------:|------------:|
| 4096 | 4096 | 4096 | max | 0.00032407 | 0.000188231 |
| 4096 | 4096 | 16384 | max | 0.00104356 | 0.000624001 |
| 4096 | 16384 | 4096 | max | 0.000902069 | 0.000527382 |
| 4096 | 16384 | 16384 | max | 0.00302202 | 0.00255153 |
| 16384 | 4096 | 4096 | max | 0.000384343 | 0.000403249 |
| 16384 | 4096 | 16384 | max | 0.00126445 | 0.00135069 |
| 16384 | 16384 | 4096 | max | 0.000880814 | 0.000825679 |
| 16384 | 16384 | 16384 | max | 0.00337611 | 0.00319515 |
cc xwang233 ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63298
Reviewed By: mruberry
Differential Revision: D30383583
Pulled By: ngimel
fbshipit-source-id: 14dd9d67002c53a153721812709033c198f68c1e