Make `chunk_content_internal` parallel (#3836)
This switches the `chunk_content_internal` function from a sequential
BFS to a parallel BFS (+ reverse topological sort at the end).
I expected this to make some difference in performance, as traversing
references in parallel can lead to better CPU usage (see #3771), but in
practice our benchmarks show no significant difference.
Real apps might be a different story, but I didn't notice any particular
performance improvement on vercel.com either.
This implementation is not perfect (we're making more calls to
`get_children` than strictly necessary), but I think it's enough to
measure a potential performance improvement.
Marking this as a draft for now as it's more complicated than the
current implementation and there's no clear win to adopting this.