Parallelize cpu index_put accumulate float path with cpu_atomic_add_float (#29705)
Summary:
This is try to parallelize index_put accumulate path for float type on CPU. cpu_atomic_add_float is implemented by using atomic_compare_exchange_strong function.
for [DLRM](https://github.com/facebookresearch/dlrm) benchmark, _index_put_impl_ function time can be reduced from 827.741ms to 116.646ms for 1000 batches
Add a parameter "grain_size" to TensorIterator::for_each to fine tune the index_put performance
The default value of grain_size is internal::GRAIN_SIZE. The index_put grain size is tuned to 3000 and cpu_kernel_vec grain size is tuned to 1024. The following is the grain size impact on the DLRM ops
( _index_put_impl_ based on index_put been parallellized with cpu_atomic_add_float):
| Op Name | without small grain_size | with 1024 as grain_size in cpu_kernel_vec and 3000 in cpu_index_kernel |
|-----------------|----------:|----------:|
| add_ | 11.985s | 11.601s |
| mm | 9.706s | 9.518s |
| addmm | 5.380s | 5.247s |
| _embedding_bag | 2.992s | 2.663s |
| _embedding_bag_backward | 1.330s | 1.354s |
| threshold_backward | 686.920ms | 659.169ms |
| _index_put_impl_ | 489.411ms | 116.646ms |
| bmm | 413.129ms | 362.967ms |
| zero_ | 379.659ms | 310.623ms |
| add | 205.904ms | 171.111ms |
| cat | 187.101ms | 175.621ms |
| Self CPU time total (s) | 36.544 | 34.742 |
| Average ms per iteration | 38.25 | 36.44 |
The more reason for grain size tuning, please further look at [PR#30803](https://github.com/pytorch/pytorch/issues/30803)
to get the DLRM performance here, please also have a look at
[PR#23057](https://github.com/pytorch/pytorch/pull/23057), [PR#24385](https://github.com/pytorch/pytorch/pull/24385) and [PR#27804](https://github.com/pytorch/pytorch/pull/27804)
and expose the env vars as below:
```
export LD_PRELOAD=$HOME/anaconda3/lib/libjemalloc.so (conda install jemalloc)
export KMP_BLOCKTIME=1
export KMP_AFFINITY="granularity=fine,compact,1,0"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29705
Differential Revision: D19777742
Pulled By: VitalyFedyunin
fbshipit-source-id: a8222fe6089b6bf56b674e35f790508ad05385c0