binary loops with generic implementation (#21475)

Commit View On GitHub

Commit

5 years ago

Replace nullary/unary/binary loops with generic implementation (#21475) Summary: ``` This replaces the kernel helpers in Loops.h/cuh with the following: cpu_kernel cpu_kernel_vec gpu_kernel gpu_kernel_with_scalars These work with functions with any number of input arugments, with the exception of 'gpu_kernel_with_scalars' which is limited to binary operations. Previously, we only supported functions of 0, 1, or 2 input arguments. Adding support for 3 or 4 input argument functions required significant amount of additional code. This makes a few other changes: Remove 'ntensors' from the for_each/serial_for_each loop. Most loops assume a fixed number of tensors, and the value is accessible from TensorIterator::ntensors() Only lift CPU scalars to parameters in 'gpu_kernel_with_scalars'. Previously, we performed this recursively in gpu_unary_kernel and gpu_binary_kernel, so something like `torch.add(3, 4, out=cuda_tensor)` would specialize to a "nullary" kernel. Now, only the first scalar input is lifted to a kernel parameter. Any additional scalar inputs are copied to CUDA tensors. Note that operations like `x + 5` and `5 + x` still work efficiently. This avoids generating an exponential number of specializations in the number of input arguments. ``` **Performance measurements** Timing numbers are unchanged for basic elementwise operations. Linked below is a script to measure torch.add perf on PR vs. master CPU+GPU (GCC 7.3): [miniperf.py](https://gist.github.com/colesbury/4a61893a22809cb0931f08cd37127be4) **Generated assembly** cpu_kernel and cpu_kernel_vec still generate good vectorized code with both GCC 7.3 and GCC 4.8.5. Below is the assembly for the "hot" inner loop of torch.add as well as an auto-vectorized torch.mul implementation using cpu_kernel/ binary_kernel. (The real torch.mul uses cpu_kernel_vec but I wanted to check that auto vectorization still works well): [torch.add GCC 7.3](https://gist.github.com/colesbury/927ddbc71dc46899602589e85aef1331) [torch.add GCC 4.8](https://gist.github.com/colesbury/f00e0aafd3d1c54e874e9718253dae16) [torch.mul auto vectorized GCC 7.3](https://gist.github.com/colesbury/3077bfc65db9b4be4532c447bc0f8628) [torch.mul auto vectorized GCC 4.8](https://gist.github.com/colesbury/1b38e158b3f0aaf8aad3a76963fcde86) Pull Request resolved: https://github.com/pytorch/pytorch/pull/21475 Differential Revision: D15745116 Pulled By: colesbury fbshipit-source-id: 914277d7930dc16e94f15bf87484a4ef82890f91

Author

colesbury

Committer

facebook-github-bot

Parents

7f057f00

pytorch d8314a62 - Replace nullary/unary/binary loops with generic implementation (#21475)

Commit

pytorch
d8314a62 - Replace nullary/unary/binary loops with generic implementation (#21475)