more aggressive mix order reduction (#166382)
Summary:
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.
X-link: https://github.com/pytorch/pytorch/pull/166382
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #166053
Reviewed By: donigian
Differential Revision: D86056478
fbshipit-source-id: 6a1561fb54d450b69b08d41c1836c0361577e8f4