Fix remaining CPU operators for non-standard bools (#79390)
In addition to using `c10::load` in more kernels, this specializes
`Vectorized<bool>::loadu` to first load as `int8_t` then convert to
bool, in the same way as `c10::load` does.
For `unique_cpu`, the values are loaded inside of the
`std::unordered_set` constuctor, so I've added a separate function
specifically for bools. This fixes the loading issue and is faster
because it uses the fact that bools can only have one of two values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79390
Approved by: https://github.com/mruberry