[BuildSpeed] Limit `Logcumsumexp` complex to OSS builds only (#98957)
As it takes ridiculous amount of time to build with complex times on CUDA-11.4.
Build speeds for a single gpu architecture (`sm_80`) on 3Ghz 8275CL Intel Xeon:
- 143 sec to compile for all dtypes using CUDA-11.6
- 351 sec to compile for all dtypes using CUDA-11.4
- 24 sec to compile for only floating dtypes using CUDA-11.6
- 52 sec to compile for only floating dtypes using CUDA-11.4
Tweak code a bit to make it compilable with MSVC, which is having trouble with nested preprocessor directives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98957
Approved by: https://github.com/r-barnes, https://github.com/ngimel