llvm-project
4bc9d29f - [SROA] Use tree-structure merge to remove alloca (#152793)

Commit
3 days ago
[SROA] Use tree-structure merge to remove alloca (#152793) This patch introduces a new optimization in SROA that handles the pattern where multiple non-overlapping vector `store`s completely fill an `alloca`. The current approach to handle this pattern introduces many `.vecexpand` and `.vecblend` instructions, which can dramatically slow down compilation when dealing with large `alloca`s built from many small vector `store`s. For example, consider an `alloca` of type `<128 x float>` filled by 64 `store`s of `<2 x float>` each. The current implementation requires: - 64 `shufflevector`s( `.vecexpand`) - 64 `select`s ( `.vecblend` ) - All operations use masks of size 128 - These operations form a long dependency chain This kind of IR is both difficult to optimize and slow to compile, particularly impacting the `InstCombine` pass. This patch introduces a tree-structured merge approach that significantly reduces the number of operations and improves compilation performance. Key features: - Detects when vector `store`s completely fill an `alloca` without gaps - Ensures no loads occur in the middle of the store sequence - Uses a tree-based approach with `shufflevector`s to merge stored values - Reduces the number of intermediate operations compared to linear merging - Eliminates the long dependency chains that hurt optimization Example transformation: ``` // Before: (stores do not have to be in order) %alloca = alloca <8 x float> store <2 x float> %val0, ptr %alloca ; offset 0-1 store <2 x float> %val2, ptr %alloca+16 ; offset 4-5 store <2 x float> %val1, ptr %alloca+8 ; offset 2-3 store <2 x float> %val3, ptr %alloca+24 ; offset 6-7 %result = load <8 x float>, ptr %alloca // After (tree-structured merge): %shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %result = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> ``` Benefits: - Logarithmic depth (O(log n)) instead of linear dependency chains - Fewer total operations for large vectors - Better optimization opportunities for subsequent passes - Significant compilation time improvements for large vector patterns For some large cases, the compile time can be reduced from about 60s to less than 3s. --------- Co-authored-by: chengjunp <chengjunp@nividia.com>
Author
Parents
Loading