[caffe2] Add an optimization to avoid extra fp32->fp16 conversions in Onnxifi (#53560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53560
If an op like Fused8BitRowwiseQuantizedToFloat ends up on CPU and Tile ends up on an accelerator and only FP16 is supported, then we want to make sure conversion from FP32 to FP16 is done on CPU to save cycles on accelerator.
Reviewed By: ChunliF
Differential Revision: D26862322
fbshipit-source-id: a7af162f2537ee9e4a78e6ef3f587129de410b07