onnxruntime
1f458385 - [MLAS] Fix rotary avx2 kernel invalid access (#26389)

Commit
96 days ago
[MLAS] Fix rotary avx2 kernel invalid access (#26389) This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in https://github.com/microsoft/onnxruntime/pull/23694. The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements. Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address. The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access). The Solution: Use a Scalar Remainder Loop The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug. The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Author
Parents
Loading