[Onnxifi] Warmup cache of output shapes (#48346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48346
Onnxifi now accepts output shape info for all possible batch sizes. This is used to avoid doing shape inference inside `OnnxifiOp::extractOutputBatchSizes()`.
FB:
In this diff we try to pre-calculate output shapes for all possible batch sizes inside `PredictorContainer` where we supposedly have enough data to do so. This data is then passed down to OnnxifiOp.
Here is the dependency graph that I built manually trying to understand the entire flow.
https://pxl.cl/1rQRv
Test Plan:
Strobelight data https://fburl.com/strobelight/jlhhgt21 shows that `OnnxifiOp::RunOnDevice()` now takes only 2.17% of CPU instead of ~20% CPU with the current implementation.
Also, the current implementation takes dozens of milliseconds according to ipiszy:
> After adding more logs, I found each shapeinference call actually takes 40~50ms.
I also added added time measurements temporarily for `OnnxifiOp::extractOutputBatchSizes()`. New impenentation typically consumes 1 to 4 microseconds, and, when data for current bs is not present yet in `output_reshape_info_`, it takes 20-40 microseconds which is still much better than the current implementation.
AF canary https://www.internalfb.com/intern/ads/canary/431357944274985799
AI canary https://www.internalfb.com/intern/ads/canary/431365503038313840
Verifying using test tier https://pxl.cl/1sZ4S
Reviewed By: yinghai, ipiszy
Differential Revision: D25047110
fbshipit-source-id: 872dc1578a1e8e7c3ade5f5e2711e77ba290a671