Implement parallel marking (#48600)
Using a work-stealing queue after Chase and Lev, optimized for
weak memory models by Le et al.
Default number of GC threads is half the number of compute threads.
Co-authored-by: Gabriel Baraldi <baraldigabriel@gmail.com>
Co-authored-by: Valentin Churavy <v.churavy@gmail.com>