Fix DeepCompile all-gather scheduler candidate selection (#8033)
This PR fixes issues with the heuristic in DeepCompile's scheduler:
- Fix a candidate-selection bug in `fast_free_schedule()`: the scheduler
computed the zero-`free_acc_mem` candidate subset, but then sorted the
full runnable set instead of that subset.
- Keep the existing local scheduling heuristic, but rank candidates with
graph-local all-gather pressure metrics before release-side cost when a
low-live release path is available.
- Add deterministic CPU-only FX scheduler regressions for the zero-free
filter, pressure ordering, fallback candidate ordering, and
single-all-gather ordering.
## Rationale
`fast_free_schedule()` is a local heuristic for reducing
gathered-parameter live ranges. This patch keeps that model, but fixes a
general selection inconsistency: when at least one runnable candidate
can reach release without additional all-gathers, the scheduler should
choose from that zero-`free_acc_mem` subset. The previous code used the
subset only as a branch condition, then ranked all runnable candidates
by `free_cost`, so it could select a candidate that still required
additional all-gathers before release.
After preserving the zero-`free_acc_mem` filter, the ordering uses only
workload-independent graph pressure signals already available to the
scheduler: scheduled all-gather count, all-gather byte pressure,
release-side cost, and a stable node-name tie breaker. In the fallback
path, where every candidate still requires additional all-gathers,
`free_acc_mem` remains the primary selector and the scheduler preserves
the previous boundary of scheduling only through `schedule_until_ag`;
this avoids making a memory-budget decision without tracking
already-live gathered parameters.
## Testing
- `python -m pytest tests/unit/compile/test_list_schedule.py -q`
- `pre-commit run --all-files`
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>