fix to operate on cuda kernel with clang and libc++ (#25553)
Summary:
We find a bug about `std::tuple` with nvcc.
In C++11, `std::tuple` constructor is constexpr in libstdc++, but is not constexpr in libc++.
https://github.com/pytorch/pytorch/blob/c36b77fcdad3d54227cf0fd51693eb57035002c0/aten/src/ATen/native/cuda/Loops.cuh#L109-L111
The lines have occurred crashes in CUDA with a message `scan failed with synchronize`. It is a error message of cuda initialization.
The purpose of this PR is fixed for loop in nvcc and libc++ by not using `std::tuple`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25553
Differential Revision: D17582118
Pulled By: yf225
fbshipit-source-id: d6f62ed46c2415b48eb49f8a051cf3c0e7cb23ce