Make CUDA graph benchmarking overridable on a per-op basis
Summary: some operators need to do gpu-cpu syncs, which is not supported under graph capture
Reviewed By: davidberard98
Differential Revision: D58680076
fbshipit-source-id: 7c86c484990445512723ebdda25ef4af8cfffde5