pytorch
3868eeb7 - fix biasadd OMP perf issue for the packed MKL SGEMM (#92300)

Commit
3 years ago
fix biasadd OMP perf issue for the packed MKL SGEMM (#92300) Currently the biasadd of MKL SGEMM was executed using OpenMP macro, this will lead to a performance issue if the SGEMM size is very small (e.g., M = 1, K = 80, N = 256) when we are using many threads. The reason is that in such case `num_task < num_thread`, and the task cost is too small (e.g., ~1-2 cycles for memcpy), the thread synchronization cost would be very large. Thus it is better to use `at::parallel_for` to run on the main thread directly. Packed MKL SGEMM (1x80x256) | OpenMP biasadd | `at::parallel_for` biasadd -- | -- | -- Latency | 2000 us | 21 us Pull Request resolved: https://github.com/pytorch/pytorch/pull/92300 Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5
Author
Committer
Parents
Loading