lighteval
Use `n=16` samples to estimate `pass@1` for AIME benchmarks
#661
Merged

Loading