llvm-project
bb369f1c - [libc][x86] Add Non-temporal code path for large memcpy (#187108)

Commit
28 days ago
[libc][x86] Add Non-temporal code path for large memcpy (#187108) Large memcopies are pretty rare, but are more common in ML workloads (copying large matrixes/tensors, often to/from CPU host). For large copies NTA stores can provide performance advantages for both memcpy itself and the rest of the workload (by reducing cache pollution). Other runtimes already have NTA path for large copies, so add 1 to the llvm-libc. Internal whole-program loadtests shows small, but statistically significant improvement of 0.1%. ML specific bencahmrks showed 10-20% performance gain, and fleetbench (https://github.com/google/fleetbench, which has more up-to-date version of libc benchmarks) shows ~3% gain (ns/byte for distributions taken from various applications). ``` [Memcpy_0]_L1 0.01950n ± 3% 0.01900n ± 5% ~ (p=0.390 n=20) [Memcpy_0]_L2 0.02300n ± 0% 0.02300n ± 0% ~ (p=0.256 n=20) [Memcpy_0]_LLC 0.1335n ± 1% 0.1310n ± 1% -1.87% (p=0.000 n=20) [Memcpy_0]_Cold 0.1540n ± 2% 0.1520n ± 1% -1.30% (p=0.021 n=20) [Memcpy_1]_L1 0.04300n ± 5% 0.04200n ± 2% -2.33% (p=0.000 n=20) [Memcpy_1]_L2 0.05000n ± 2% 0.04800n ± 0% -4.00% (p=0.000 n=20) [Memcpy_1]_LLC 0.2500n ± 2% 0.2390n ± 1% -4.40% (p=0.000 n=20) [Memcpy_1]_Cold 0.2750n ± 1% 0.2640n ± 1% -4.00% (p=0.000 n=20) [Memcpy_2]_L1 0.03800n ± 3% 0.03800n ± 3% ~ (p=0.420 n=20) [Memcpy_2]_L2 0.04400n ± 2% 0.04300n ± 0% -2.27% (p=0.000 n=20) [Memcpy_2]_LLC 0.2320n ± 1% 0.2220n ± 1% -4.31% (p=0.000 n=20) [Memcpy_2]_Cold 0.2565n ± 1% 0.2460n ± 1% -4.09% (p=0.000 n=20) [Memcpy_3]_L1 0.1380n ± 1% 0.1355n ± 2% ~ (p=0.095 n=20) [Memcpy_3]_L2 0.1490n ± 1% 0.1430n ± 1% -4.03% (p=0.000 n=20) [Memcpy_3]_LLC 0.7955n ± 1% 0.7450n ± 0% -6.35% (p=0.000 n=20) [Memcpy_3]_Cold 0.8495n ± 1% 0.7935n ± 0% -6.59% (p=0.000 n=20) [Memcpy_4]_L1 0.04000n ± 3% 0.03900n ± 3% ~ (p=0.466 n=20) [Memcpy_4]_L2 0.04500n ± 2% 0.04400n ± 2% ~ (p=0.130 n=20) [Memcpy_4]_LLC 0.2040n ± 1% 0.1950n ± 1% -4.41% (p=0.000 n=20) [Memcpy_4]_Cold 0.2240n ± 1% 0.2150n ± 1% -4.02% (p=0.000 n=20) [Memcpy_5]_L1 0.05800n ± 3% 0.06050n ± 1% +4.31% (p=0.000 n=20) [Memcpy_5]_L2 0.06400n ± 0% 0.06400n ± 2% 0.00% (p=0.004 n=20) [Memcpy_5]_LLC 0.3320n ± 1% 0.3140n ± 1% -5.42% (p=0.000 n=20) [Memcpy_5]_Cold 0.3620n ± 1% 0.3430n ± 0% -5.25% (p=0.000 n=20) [Memcpy_6]_L1 0.05700n ± 2% 0.05750n ± 3% ~ (p=0.403 n=20) [Memcpy_6]_L2 0.06500n ± 0% 0.06250n ± 1% -3.85% (p=0.000 n=20) [Memcpy_6]_LLC 0.3410n ± 1% 0.3205n ± 1% -6.01% (p=0.000 n=20) [Memcpy_6]_Cold 0.3670n ± 1% 0.3470n ± 1% -5.45% (p=0.000 n=20) [Memcpy_7]_L1 0.05900n ± 2% 0.05900n ± 2% ~ (p=0.296 n=20) [Memcpy_7]_L2 0.06400n ± 2% 0.06400n ± 0% ~ (p=0.327 n=20) [Memcpy_7]_LLC 0.3145n ± 1% 0.2965n ± 1% -5.72% (p=0.000 n=20) [Memcpy_7]_Cold 0.3410n ± 1% 0.3220n ± 0% -5.57% (p=0.000 n=20) [Memcpy_8]_L1 0.03600n ± 3% 0.03600n ± 3% ~ (p=0.804 n=20) [Memcpy_8]_L2 0.04200n ± 0% 0.04100n ± 2% -2.38% (p=0.000 n=20) [Memcpy_8]_LLC 0.2210n ± 1% 0.2090n ± 1% -5.43% (p=0.000 n=20) [Memcpy_8]_Cold 0.2415n ± 1% 0.2300n ± 1% -4.76% (p=0.000 n=20) geomean 0.1184n 0.1148n -3.03% ```
Author
Parents
Loading