Fuse row-wise sharded linear matmul to increase perf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78449
Instead of looping through and performing a matmul separately, we can
just perform a single matmul to ensure we launch a single cuda kernel for this
operation.
Differential Revision: [D36743354](https://our.internmc.facebook.com/intern/diff/D36743354/)
Approved by: https://github.com/aazzolini, https://github.com/wanchaol