Add trivial reduce for Cuda (#36293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36293
Detect non-read-only loads, and not to use __ldg.
Resubmiting #36092
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D20935933
Pulled By: zheng-xq
fbshipit-source-id: f9280db26aa9c9c8119cea12571bc820f5fbcb61