Accept non-standard bools in more CUDA kernels
This fixes all remaining CUDA kernels, except those using `cub` or
`thrust`, to accept boolean tensors with values other than 1 or 0.
I do this by using `c10::load` in more places, and also adding a
`load_vector` helper into `MemoryAccess.cuh` that does the same thing
for vectorized loads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78957
Approved by: https://github.com/mruberry