Fuse matmul in row-wise sharded linear to have a single matmul.
Performing a single large matmul is more efficient than having to
perform multiple matmuls in a loop.
Similar improvement to https://github.com/pytorch/pytorch/pull/78449
Differential Revision: [D36828505](https://our.internmc.facebook.com/intern/diff/D36828505/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78672
Approved by: https://github.com/fduwjj, https://github.com/wanchaol