[SROA] Avoid redundant `.oldload` generation when `memset` fully covers a partition (#179643)
In our internal (ByteDance) builds we frequently hit very large
`DeadPhiWeb`s that cause serious compile-time slowdowns, especially in
some auto-generated code where a single file can take 20+ minutes to
compile. There were previous attempts to reduce `DeadPhiWeb` in
`InstCombine` (e.g. llvm/llvm-project#108876 and
llvm/llvm-project#158057), but in our workload we still see a lot of
time spent later in the pipeline (notably `JumpThreading` and
`CorrelatedValuePropagation`).
After digging into our cases, a big chunk of the `DeadPhiWeb` comes from
SROA rewriting `memset`s. We often end up with patterns like:
```
%.sroa.xxx.oldload = load <ty>, ptr %.sroa.xxx
%unused = ptrtoint ptr %.sroa.xxx.oldload to i64 ; or a bitcast-like use
store <ty> <new_value>, ptr %.sroa.xxx
```
Even if `%unused` is cleaned up by later DCE-style passes, the
load/store shape can still make `PromoteMem2Reg` conservatively treat
many blocks as live-in when computing IDF. With cyclic CFGs this can
easily create large, sticky dead phi webs, and the rest of the pipeline
pays for it.
The core issue is that `visitMemSetInst` was using the slice’s original
offsets (`BeginOffset`/`EndOffset`) when deciding whether it needs to
merge with an `.oldload` to preserve bytes not written by the `memset`.
First, there was a typo in the original condition (`EndOffset !=
NewAllocaBeginOffset` instead of `EndOffset != NewAllocaEndOffset`),
which effectively made the check always true and forced the merge path
in most cases. Second, even if the typo is fixed, comparing the original
slice range against the partition bounds is still too strict: cases
where the `memset` contains the partition (e.g. a large `memset` over
the whole alloca while the partition is just a subrange) would still be
misclassified as requiring an `.oldload`. Both issues lead to many
redundant loads and downstream dead phi webs.
This change switches the check to use the already-computed intersection
offsets (`NewBeginOffset`/`NewEndOffset`) against the partition bounds,
so we only generate `.oldload` when the `memset` actually writes only
part of the partition:
```diff
- if (IntTy && (BeginOffset != NewAllocaBeginOffset ||
- EndOffset != NewAllocaBeginOffset)) {
+ if (IntTy && (NewBeginOffset != NewAllocaBeginOffset ||
+ NewEndOffset != NewAllocaEndOffset)) {
; emit oldload + insertInteger merge
}
```
In our workload this cuts down a lot of pointless `.oldload`s and helps
reduce the size of dead phi webs seen after `mem2reg`, improving compile
time without changing semantics (partial overwrites still merge, full
overwrites don’t).