[C2] Optimize MulGradient Operator when inner_size is 1 (#36767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36767
Add a simpler implementation of the MulGradient cuda kernel for when inner_size==1, inner loop is eliminated.
Reviewed By: xw285cornell
Differential Revision: D21013269
fbshipit-source-id: bb62470d91a7fef6eecc3d4766a2c994ca6bb2c8