Migrate `_th_std_var` to ATen (#59258)
Summary:
Ref https://github.com/pytorch/pytorch/issues/49421
This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from gh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors.
TH Results:
```
[----------------------------- Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ---------------------------------------------------------
8 | 3.6 | 3.8 | 8.2 | 1.2
80 | 3.7 | 3.8 | 8.4 | 1.2
800 | 4.2 | 4.3 | 8.7 | 1.2
8000 | 9.0 | 9.1 | 11.2 | 1.5
80000 | 58.3 | 59.0 | 30.6 | 4.2
800000 | 546.9 | 546.9 | 183.4 | 31.3
8000000 | 5729.7 | 5701.0 | 6165.4 | 484.1
```
ATen results:
```
[----------------------------- Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ---------------------------------------------------------
8 | 4.0 | 4.0 | 8.7 | 1.2
80 | 3.6 | 3.8 | 9.0 | 1.2
800 | 4.1 | 4.3 | 8.9 | 1.2
8000 | 8.9 | 9.2 | 10.6 | 1.5
80000 | 57.0 | 57.4 | 28.8 | 4.3
800000 | 526.9 | 526.9 | 178.3 | 30.2
8000000 | 5568.1 | 5560.6 | 6042.1 | 453.2
[----------------------------- Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
8 threads: ---------------------------------------------------------
8 | 3.9 | 3.8 | 9.1 | 1.2
80 | 3.8 | 3.9 | 8.8 | 1.2
800 | 4.2 | 4.3 | 8.9 | 1.3
8000 | 9.0 | 9.2 | 10.4 | 1.5
80000 | 26.0 | 26.8 | 26.4 | 4.4
800000 | 92.9 | 87.3 | 72.1 | 22.4
8000000 | 793.5 | 791.8 | 5334.8 | 115.1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59258
Reviewed By: mruberry
Differential Revision: D28821216
Pulled By: ngimel
fbshipit-source-id: f35992c21f08a0a8878053680dc0ca7a8facd155