Port masked_select cuda from TH to ATen (#35429)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33054
This PR does not directly depend on PR https://github.com/pytorch/pytorch/issues/33269 (the CPU counterpart), but whichever one of these two PRs gets merged last should remove `_th_masked_select` and `_th_masked_select_bool` from `aten/src/ATen/Declarations.cwrap`.
Performance stats are here: https://github.com/pytorch/pytorch/issues/33054#issuecomment-591710014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35429
Differential Revision: D20958928
Pulled By: ngimel
fbshipit-source-id: 4704f5d2d271f3669cecd4f41d266ec1f67ec7f2