[quant] Remove calls to .item() for fake_quant_on (#61921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61921
For GPU training, the fake_quant_on tensors are present on the GPU and the .item() calls incur a GPU->CPU copy to access the tensor element.
This call can prove expensive and hurt the performance during training as the `item()` and `local_scalar_dense()` calls take up 11% of the total CPU execution time.
The solution here is to access the tensor on the GPU without a copy.
Individual op benchmarks show a 33% speedup just by removing the `.item()` calls
Profiler Before
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::fused_moving_avg_obs_fake_quant 5.61% 1.538ms 100.00% 27.421ms 548.425us 978.208us 3.42% 28.575ms 571.501us 50
aten::_fused_moving_avg_obs_fq_helper 27.63% 7.576ms 94.39% 25.883ms 517.668us 6.536ms 22.87% 27.597ms 551.937us 50
aten::_fake_quantize_per_tensor_affine_cachemask_ten... 11.07% 3.037ms 21.54% 5.905ms 118.103us 9.549ms 33.42% 9.549ms 190.978us 50
aten::_aminmax 19.39% 5.317ms 27.44% 7.524ms 150.484us 8.683ms 30.38% 8.683ms 173.651us 50
aten::item 4.49% 1.232ms 11.12% 3.051ms 61.011us 1.058ms 3.70% 2.829ms 56.579us 50
aten::_local_scalar_dense 6.63% 1.818ms 6.63% 1.818ms 36.363us 1.771ms 6.20% 1.771ms 35.419us 50
aten::empty 5.76% 1.579ms 5.76% 1.579ms 15.792us 0.000us 0.00% 0.000us 0.000us 100
aten::as_strided 2.29% 628.399us 2.29% 628.399us 6.284us 0.000us 0.00% 0.000us 0.000us 100
aten::empty_like 7.56% 2.073ms 17.13% 4.696ms 31.310us 0.000us 0.00% 0.000us 0.000us 150
aten::empty_strided 9.57% 2.623ms 9.57% 2.623ms 17.489us 0.000us 0.00% 0.000us 0.000us 150
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 27.421ms
Self CUDA time total: 28.575ms
```
After
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::fused_moving_avg_obs_fake_quant 6.59% 1.240ms 100.00% 18.820ms 376.396us 490.272us 2.36% 20.745ms 414.901us 50
aten::_fused_moving_avg_obs_fq_helper 26.12% 4.916ms 93.41% 17.580ms 351.597us 2.033ms 9.80% 20.255ms 405.096us 50
aten::_fake_quantize_per_tensor_affine_cachemask_ten... 14.55% 2.738ms 31.09% 5.850ms 117.005us 9.968ms 48.05% 9.968ms 199.363us 50
aten::_aminmax 25.28% 4.758ms 36.21% 6.814ms 136.278us 8.253ms 39.79% 8.253ms 165.069us 50
aten::empty 7.94% 1.494ms 7.94% 1.494ms 14.944us 0.000us 0.00% 0.000us 0.000us 100
aten::as_strided 2.99% 561.785us 2.99% 561.785us 5.618us 0.000us 0.00% 0.000us 0.000us 100
aten::empty_like 8.36% 1.573ms 16.53% 3.112ms 31.118us 0.000us 0.00% 0.000us 0.000us 100
aten::empty_strided 8.17% 1.538ms 8.17% 1.538ms 15.384us 0.000us 0.00% 0.000us 0.000us 100
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 18.820ms
Self CUDA time total: 20.745ms
```
Test Plan:
python test/test_quantization.py
Imported from OSS
Reviewed By: jingsh
Differential Revision: D29796533
fbshipit-source-id: 10abb93abd61c6ac25b8e8c114aa57b9db891918