pytorch
9a12aa6c - Add cached nvFuser's fusion creation for torch._prims.executor (#80525)

Commit

2 years ago

Add cached nvFuser's fusion creation for torch._prims.executor (#80525) In the current setup for each call of the `execute` function, a `Fusion` object was constructed using `GraphModule` and args, that's expensive. This PR makes use of `functools.lru_cache` to pay the `Fusion` creation cost once per `GraphModule` and set of args. Currently, the shape, strides, and dtype of tensors are static it can be changed later to make better use of the nvFuser's internal caching mechanism (by specifying only ndim, contiguity, dtype). On master: ```py In [2]: a = torch.randn(3, 3, device='cuda') In [3]: with TorchRefsMode.push(): ...: gm = make_fx(lambda x: torch.sigmoid(x))(a) ...: In [4]: %%timeit ...: execute(gm, a, executor="nvfuser") ...: torch.cuda.synchronize() 175 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` This PR: ```py In [2]: a = torch.randn(3, 3, device='cuda') In [3]: with TorchRefsMode.push(): ...: gm = make_fx(lambda x: torch.sigmoid(x))(a) ...: In [4]: %%timeit ...: execute(gm, a, executor="nvfuser") ...: torch.cuda.synchronize() 62.6 µs ± 9.99 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` In addition, this PR adds support for pytree inputs and extends the test for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80525 Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/SherlockNoMad

Author

IvanYashchuk

Committer

pytorchmergebot

Parents

56f4b69c

pytorch 9a12aa6c - Add cached nvFuser's fusion creation for torch._prims.executor (#80525)

pytorch
9a12aa6c - Add cached nvFuser's fusion creation for torch._prims.executor (#80525)