fix biasadd OMP perf issue for the packed MKL SGEMM (#92300)
Currently the biasadd of MKL SGEMM was executed using OpenMP macro, this will lead to a performance issue if the SGEMM size is very small (e.g., M = 1, K = 80, N = 256) when we are using many threads.
The reason is that in such case `num_task < num_thread`, and the task cost is too small (e.g., ~1-2 cycles for memcpy), the thread synchronization cost would be very large. Thus it is better to use `at::parallel_for` to run on the main thread directly.
Packed MKL SGEMM (1x80x256) | OpenMP biasadd | `at::parallel_for` biasadd
-- | -- | --
Latency | 2000 us | 21 us
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92300
Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5