Add the one-block multi-thread global reduction support. (#36306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36306
Missing __syncthreads between sections.
Differential Revision: D20957254
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Pulled By: zheng-xq
fbshipit-source-id: c988f0205b667174b3ee851c28adeec2dbd089f7