Foreach clamp_min clamp_max (#91384)
Adds `_foreach_clamp_min` and `_foreach_clamp_max` as binary ops, with scalar, scalarlist and tensorlist support.
Timing example for `_foreach_clamp_min_` on a GTX3070Ti across a list of tensors with varying count and item size (times are in microseconds (us)):
CUDA:
```
[------------------ (tensors, scalar) -------------------]
| for loop | foreach
10 tensors of size 4 | 29.0 | 10.2
100 tensors of size 4 | 234.4 | 18.3
1000 tensors of size 4 | 2194.1 | 113.5
10000 tensors of size 4 | 21745.6 | 1144.5
10 tensors of size 16 | 29.5 | 12.0
100 tensors of size 16 | 256.9 | 19.9
1000 tensors of size 16 | 2499.7 | 123.6
10000 tensors of size 16 | 25022.2 | 1295.6
10 tensors of size 256 | 32.8 | 11.2
100 tensors of size 256 | 258.8 | 19.7
1000 tensors of size 256 | 2509.2 | 123.7
10000 tensors of size 256 | 25016.2 | 1295.4
10 tensors of size 65536 | 32.9 | 18.7
100 tensors of size 65536 | 327.1 | 150.3
1000 tensors of size 65536 | 3051.3 | 1388.0
10000 tensors of size 65536 | 30476.9 | 14021.5
[------------------ (tensors, tensors) ------------------]
| for loop | foreach
10 tensors of size 4 | 26.8 | 17.3
100 tensors of size 4 | 206.8 | 90.5
1000 tensors of size 4 | 1993.0 | 828.9
10000 tensors of size 4 | 19851.0 | 9063.3
10 tensors of size 16 | 34.7 | 20.0
100 tensors of size 16 | 232.2 | 102.1
1000 tensors of size 16 | 2220.9 | 977.3
10000 tensors of size 16 | 22644.5 | 10361.4
10 tensors of size 256 | 30.5 | 19.7
100 tensors of size 256 | 231.6 | 102.4
1000 tensors of size 256 | 2251.9 | 978.7
10000 tensors of size 256 | 22680.3 | 10405.8
10 tensors of size 65536 | 30.6 | 34.4
100 tensors of size 65536 | 315.1 | 223.6
1000 tensors of size 65536 | 3252.1 | 2114.4
10000 tensors of size 65536 | 30578.0 | 22826.3
```
CPU:
```
[------------------- (tensors, scalar) -------------------]
| for loop | foreach
10 tensors of size 4 | 13.0 | 9.6
100 tensors of size 4 | 62.4 | 31.6
1000 tensors of size 4 | 562.2 | 245.6
10000 tensors of size 4 | 5552.2 | 2517.7
10 tensors of size 16 | 14.9 | 11.3
100 tensors of size 16 | 74.1 | 36.9
1000 tensors of size 16 | 663.7 | 285.5
10000 tensors of size 16 | 6765.2 | 2947.5
10 tensors of size 256 | 15.2 | 11.8
100 tensors of size 256 | 76.0 | 37.7
1000 tensors of size 256 | 728.8 | 323.9
10000 tensors of size 256 | 7274.4 | 3800.3
10 tensors of size 65536 | 105.6 | 124.5
100 tensors of size 65536 | 982.8 | 939.7
1000 tensors of size 65536 | 14993.1 | 14579.2
10000 tensors of size 65536 | 163091.0 | 151555.8
[------------------- (tensors, tensors) ------------------]
| for loop | foreach
10 tensors of size 4 | 11.8 | 10.5
100 tensors of size 4 | 53.1 | 38.2
1000 tensors of size 4 | 465.1 | 316.1
10000 tensors of size 4 | 4616.9 | 3625.9
10 tensors of size 16 | 13.5 | 12.3
100 tensors of size 16 | 63.0 | 46.5
1000 tensors of size 16 | 560.1 | 359.9
10000 tensors of size 16 | 5586.8 | 3765.9
10 tensors of size 256 | 15.2 | 13.7
100 tensors of size 256 | 64.4 | 48.3
1000 tensors of size 256 | 653.7 | 410.0
10000 tensors of size 256 | 5916.6 | 3901.3
10 tensors of size 65536 | 109.1 | 106.8
100 tensors of size 65536 | 1128.9 | 1105.0
1000 tensors of size 65536 | 16245.0 | 15950.8
10000 tensors of size 65536 | 171111.3 | 163540.2
```
Example use:
```
tensors = [torch.randn(16, device='cuda') for _ in range(10)]
out = torch._foreach_clamp_min(tensors, 0.1)
out = torch._foreach_clamp_min(tensors, [0.1] * len(tensors))
out = torch._foreach_clamp_min(tensors, tensors)
torch._foreach_clamp_min_(tensors, 0.1)
torch._foreach_clamp_min_(tensors, [0.1] * len(tensors))
torch._foreach_clamp_min_(tensors, tensors)
```
Does not support complex types.
Changes the existing `foreach_minimum/maximum` to use this new implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91384
Approved by: https://github.com/ngimel