fix grain size setting for baddbmm_cpu_kernel (#98297)
fix https://github.com/pytorch/pytorch/issues/92892
the `grain_size` setting for parallelization in baddbmm_cpu_kernel is wrong, which will make small input size go parallel, leading to omp threading overhead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98297
Approved by: https://github.com/lezcano, https://github.com/nikitaved