[pytorch] fewer cuda sync in unique by using cub instead of thrust (#57323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57323
Use cub library instead of thrust to reduce # of cuda stream synchronize.
Reviewed By: ngimel
Differential Revision: D28088029
fbshipit-source-id: b616294cd776aa5643c153e172258a0153a42b6a