Foreach gradient clipping (#91846)
Faster gradient clipping using the foreach functions
```
[------------------------ (tensors, scalar) -------------------------]
| without foreach | with foreach | apex
1 threads: ----------------------------------------------------------------------
10 tensors of size 4 | 120.5 | 61.1 | 50.3
100 tensors of size 4 | 946.2 | 239.5 | 136.3
1000 tensors of size 4 | 9808.5 | 2151.1 | 1006.9
10000 tensors of size 4 | 96871.2 | 22637.4 | 10119.1
10 tensors of size 16 | 121.0 | 64.1 | 52.5
100 tensors of size 16 | 993.4 | 252.6 | 136.7
1000 tensors of size 16 | 9427.7 | 2151.2 | 1049.5
10000 tensors of size 16 | 97437.1 | 22203.1 | 10340.0
10 tensors of size 256 | 118.9 | 62.3 | 51.5
100 tensors of size 256 | 955.2 | 243.1 | 134.2
1000 tensors of size 256 | 9374.9 | 2140.7 | 1009.6
10000 tensors of size 256 | 95302.5 | 21849.4 | 10215.5
10 tensors of size 65536 | 118.5 | 62.4 | 51.1
100 tensors of size 65536 | 1740.7 | 243.3 | 225.3
1000 tensors of size 65536 | 17364.1 | 2228.7 | 2004.5
10000 tensors of size 65536 | 177510.1 | 25410.4 | 20678.2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91846
Approved by: https://github.com/janeyx99