[Static Runtime][DI] Fuse list unpack and variadic_grouped_accessor_op (#66585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66585
Add a new op `static_runtime::fused_variadic_grouped_accessor_op` that outputs many tensors rather than a single tensor list. Incorporated this new op into `FuseListUnpack`. This eliminates `ListUnpack` overhead and tensor refcount bumps.
Test Plan:
**Accuracy Test**
Model 294738512_40 (manually confirmed that fusion happens)
```
get 2861 prediction values
get 2861 prediction values
max_error: 0 total: 0
```
Accuracy test with model 296213501_65 (has V2 op): passes with 0 errors.
**Performance**
TW replayer test w/ 800 QPS (stacked with D31482816 (https://github.com/pytorch/pytorch/commit/72e25c9f4ed15a94974041d1d8e76125cff8c48d)) shows 5% CPU decrease for storage tier.
Results:
{F673610679}
Reviewed By: hlu1
Differential Revision: D31620408
fbshipit-source-id: f05c298bcbce61a491b63d575af4aca746881696