[libc] Rework slab cache data structure for GPU allocator
Summary:
This was previously a Trieber stack, which is a perfectly fine generic
and lock-free data structure. However, this used some expensive CAS
operations and had issues with ABA. Because the only user of this was
the slab cache mechanism, we can pretty safely specialize it. Instead,
we simply search a fixed size buffer for some sentinal values and CAS
into it.
For allocations that only ever hit the cache, this improves performance
from ~9000 cycles to ~6000 cycles and similar improvements for workloads
that feel the pain of small thread counts hitting the cache.