Add cached nvFuser's fusion creation for torch._prims.executor (#80525)
In the current setup for each call of the `execute` function, a `Fusion` object was constructed using `GraphModule` and args, that's expensive.
This PR makes use of `functools.lru_cache` to pay the `Fusion` creation cost once per `GraphModule` and set of args. Currently, the shape, strides, and dtype of tensors are static it can be changed later to make better use of the nvFuser's internal caching mechanism (by specifying only ndim, contiguity, dtype).
On master:
```py
In [2]: a = torch.randn(3, 3, device='cuda')
In [3]: with TorchRefsMode.push():
...: gm = make_fx(lambda x: torch.sigmoid(x))(a)
...:
In [4]: %%timeit
...: execute(gm, a, executor="nvfuser")
...: torch.cuda.synchronize()
175 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
This PR:
```py
In [2]: a = torch.randn(3, 3, device='cuda')
In [3]: with TorchRefsMode.push():
...: gm = make_fx(lambda x: torch.sigmoid(x))(a)
...:
In [4]: %%timeit
...: execute(gm, a, executor="nvfuser")
...: torch.cuda.synchronize()
62.6 µs ± 9.99 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```
In addition, this PR adds support for pytree inputs and extends the test for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80525
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/SherlockNoMad