add cuda sync when ops running on gpu (#29936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29936
This diff adds synchronization after op execution to ensure all the cuda streams complete.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M64_N64_K64_cpu
# Input: M: 64, N: 64, K: 64, device: cpu
Forward Execution Time (us) : 154.412
# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M64_N64_K64_cuda
# Input: M: 64, N: 64, K: 64, device: cuda
Forward Execution Time (us) : 101.115
...
Reviewed By: hl475
Differential Revision: D18542732
fbshipit-source-id: b979d26a174f488e971074dc1e16b00e17179c80