pytorch
9d20d6d5 - Foreach clamp_min clamp_max (#91384)

Commit
1 year ago
Foreach clamp_min clamp_max (#91384) Adds `_foreach_clamp_min` and `_foreach_clamp_max` as binary ops, with scalar, scalarlist and tensorlist support. Timing example for `_foreach_clamp_min_` on a GTX3070Ti across a list of tensors with varying count and item size (times are in microseconds (us)): CUDA: ``` [------------------ (tensors, scalar) -------------------] | for loop | foreach 10 tensors of size 4 | 29.0 | 10.2 100 tensors of size 4 | 234.4 | 18.3 1000 tensors of size 4 | 2194.1 | 113.5 10000 tensors of size 4 | 21745.6 | 1144.5 10 tensors of size 16 | 29.5 | 12.0 100 tensors of size 16 | 256.9 | 19.9 1000 tensors of size 16 | 2499.7 | 123.6 10000 tensors of size 16 | 25022.2 | 1295.6 10 tensors of size 256 | 32.8 | 11.2 100 tensors of size 256 | 258.8 | 19.7 1000 tensors of size 256 | 2509.2 | 123.7 10000 tensors of size 256 | 25016.2 | 1295.4 10 tensors of size 65536 | 32.9 | 18.7 100 tensors of size 65536 | 327.1 | 150.3 1000 tensors of size 65536 | 3051.3 | 1388.0 10000 tensors of size 65536 | 30476.9 | 14021.5 [------------------ (tensors, tensors) ------------------] | for loop | foreach 10 tensors of size 4 | 26.8 | 17.3 100 tensors of size 4 | 206.8 | 90.5 1000 tensors of size 4 | 1993.0 | 828.9 10000 tensors of size 4 | 19851.0 | 9063.3 10 tensors of size 16 | 34.7 | 20.0 100 tensors of size 16 | 232.2 | 102.1 1000 tensors of size 16 | 2220.9 | 977.3 10000 tensors of size 16 | 22644.5 | 10361.4 10 tensors of size 256 | 30.5 | 19.7 100 tensors of size 256 | 231.6 | 102.4 1000 tensors of size 256 | 2251.9 | 978.7 10000 tensors of size 256 | 22680.3 | 10405.8 10 tensors of size 65536 | 30.6 | 34.4 100 tensors of size 65536 | 315.1 | 223.6 1000 tensors of size 65536 | 3252.1 | 2114.4 10000 tensors of size 65536 | 30578.0 | 22826.3 ``` CPU: ``` [------------------- (tensors, scalar) -------------------] | for loop | foreach 10 tensors of size 4 | 13.0 | 9.6 100 tensors of size 4 | 62.4 | 31.6 1000 tensors of size 4 | 562.2 | 245.6 10000 tensors of size 4 | 5552.2 | 2517.7 10 tensors of size 16 | 14.9 | 11.3 100 tensors of size 16 | 74.1 | 36.9 1000 tensors of size 16 | 663.7 | 285.5 10000 tensors of size 16 | 6765.2 | 2947.5 10 tensors of size 256 | 15.2 | 11.8 100 tensors of size 256 | 76.0 | 37.7 1000 tensors of size 256 | 728.8 | 323.9 10000 tensors of size 256 | 7274.4 | 3800.3 10 tensors of size 65536 | 105.6 | 124.5 100 tensors of size 65536 | 982.8 | 939.7 1000 tensors of size 65536 | 14993.1 | 14579.2 10000 tensors of size 65536 | 163091.0 | 151555.8 [------------------- (tensors, tensors) ------------------] | for loop | foreach 10 tensors of size 4 | 11.8 | 10.5 100 tensors of size 4 | 53.1 | 38.2 1000 tensors of size 4 | 465.1 | 316.1 10000 tensors of size 4 | 4616.9 | 3625.9 10 tensors of size 16 | 13.5 | 12.3 100 tensors of size 16 | 63.0 | 46.5 1000 tensors of size 16 | 560.1 | 359.9 10000 tensors of size 16 | 5586.8 | 3765.9 10 tensors of size 256 | 15.2 | 13.7 100 tensors of size 256 | 64.4 | 48.3 1000 tensors of size 256 | 653.7 | 410.0 10000 tensors of size 256 | 5916.6 | 3901.3 10 tensors of size 65536 | 109.1 | 106.8 100 tensors of size 65536 | 1128.9 | 1105.0 1000 tensors of size 65536 | 16245.0 | 15950.8 10000 tensors of size 65536 | 171111.3 | 163540.2 ``` Example use: ``` tensors = [torch.randn(16, device='cuda') for _ in range(10)] out = torch._foreach_clamp_min(tensors, 0.1) out = torch._foreach_clamp_min(tensors, [0.1] * len(tensors)) out = torch._foreach_clamp_min(tensors, tensors) torch._foreach_clamp_min_(tensors, 0.1) torch._foreach_clamp_min_(tensors, [0.1] * len(tensors)) torch._foreach_clamp_min_(tensors, tensors) ``` Does not support complex types. Changes the existing `foreach_minimum/maximum` to use this new implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91384 Approved by: https://github.com/ngimel
Author
Committer
Parents
Loading