simplify cpu_kernel to not have contiguous special case (#58830)
Summary:
Per title
`unroll_contiguous_scalar_checks` tries to verify that all arguments (including outputs) are contiguous except maybe 1 scalar (with stride 0). Then it calls the passed lambda with index of the scalar arg if this verification succeeded, or 0 if args were not contiguous/there was no scalar. Depending on the value of this index (with 0=not found) a different function can be called (in vectorized kernels it’s vectorized loop if args are contiguous + scalar, and basic loop if not). It makes sense for vectorized kernel (vectorized loop can still be used in some broadcasted cases), but all other (cpu_kernel, serial_cpu_kernel, cpu_kernel_multiple_outputs) don’t even use idx argument in lambda, so regardless of what `unroll_contiguous_scalar_checks` does, they'll do the same thing. No point in calling `unroll_contiguous_scalar_checks` then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58830
Reviewed By: zou3519, mruberry
Differential Revision: D28632668
Pulled By: ngimel
fbshipit-source-id: c6db3675933184e17cc249351c4f170b45d28865
Author
Natalia Gimelshein