[PyTorch ] Thread parallel bmm across batch dim (#59596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59596
Parallelize batch matmul across batch dim. This was found to improve perf for
some usecases on mobile.
ghstack-source-id: 130989569
Test Plan: CI unit tests
Reviewed By: albanD
Differential Revision: D26833417
fbshipit-source-id: 9b84d89d29883a6c9d992d993844dd31a25f76b1