Support non-standard bools in CUDA unique (#79392)
For the thrust version of unique, this just uses `c10::load`. However,
the cub implementation is a bit more complicated since the values are
pointers are dereferenced inside of the `cub` library.
For the consecutive values path, I user `cub::TransformInputIterator`
to read the values as a `uint8_t` then cast to `bool`. But for the
path requiring a sort, this doesn't work because `cub`'s radix sorts
don't support iterators. Instead, I've written a special case for
`bool` which avoids the sorting step entirely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79392
Approved by: https://github.com/ngimel