Use c10::cuda:: primitives rather than make CUDA runtime calls directly (#41405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405
Test Plan:
**Imported from GitHub: all checks have passed**
{F244195355}
**The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences:**
- AssertionError: 0.435608434677124 != 0.4356083869934082
- AssertionError: 0.4393022060394287 != 0.4393021583557129
- AssertionError: 0.44707541465759276 != 0.44707536697387695
These are all very small numerical errors (within 0.0000001).
Reviewed By: malfet
Differential Revision: D22531486
Pulled By: threekindoms
fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff