Add trivial reduce for Cuda (#36092)
Summary:
Detect non-read-only loads, and not to use __ldg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36092
Reviewed By: ZolotukhinM
Differential Revision: D20876204
Pulled By: zheng-xq
fbshipit-source-id: a719f3583cc4ca30fcfb49d999ca785181354d84