Improve Normalization.cuh (#83871)
remove unused Ops
replaced copy-and-paste by calling BlockReduce (+SumReduceOp +2D block indexing) and removing duplicate warpSum
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83871
Approved by: https://github.com/ngimel