Implement parallel sweeping of stack pools (#55643)
Also use a round robin to only return stacks one thread at a time to
avoid contention on munmap syscalls.
Using
https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl
as a benchmark it's about 12% faster wall time. This benchmark has other
weird behaviours specially single threaded. Where if calls `wait`
thousandas of times per second, and if single threaded every single one
does a `jl_process_events` call which is a syscall + preemption. So it
looks like a hang. With threads the issue isn't there
The idea behind the round robin is twofold. One we are just freeing too
much and talking with vtjnash we maybe want some less agressive
behaviour, the second is that munmap takes a lock in most OSs. So doing
it in parallel has severe negative scaling.