[quant][pyper] Support quantization of ops in fork-wait subgraph (#44048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44048
Inline the fork-wait calls to make sure we can see the ops to be quantized in the main graph
Also fix the InlineForkWait JIT pass to account for the case where the aten::wait call isn't present in the main graph
and we return future tensor from subgraph
Example
```
graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_6325.DperModuleWrapper,
%argument_1.1 : Tensor,
%argument_2.1 : Tensor):
%3 : Future[Tensor[]] = prim::fork_0(%self.1, %argument_1.1, %argument_2.1) # :0:0
return (%3)
with prim::fork_0 = graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_5396.DperModuleWrapper,
%argument_1.1 : Tensor,
%argument_2.1 : Tensor):
%3 : __torch__.dper3.core.interop.___torch_mangle_6330.DperModuleWrapper = prim::GetAttr[name="x"](%self.1)
%4 : __torch__.dper3.core.interop.___torch_mangle_5397.DperModuleWrapper = prim::GetAttr[name="y"](%self.1)
%5 : __torch__.dper3.core.interop.___torch_mangle_6327.DperModuleWrapper = prim::GetAttr[name="z"](%4)
%6 : Tensor = prim::CallMethod[name="forward"](%5, %argument_1.1, %argument_2.1) # :0:0
%7 : None = prim::CallMethod[name="forward"](%3, %6) # :0:0
%8 : Tensor[] = prim::ListConstruct(%6)
return (%8)
```
Test Plan:
python test/test_quantization.py test_interface_with_fork
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23481003
fbshipit-source-id: 2e756be73c248319da38e053f021888b40593032