pytorch
d20483a9 - Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592)

Commit

4 years ago

Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592) Summary: Fixes https://github.com/pytorch/pytorch/issues/42890 - Removes dummy node - Places graph root on the correct queue based on input buffer's device instead of cpu queue by default cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs **CPU** Code: ``` import torch from torch.utils.benchmark import Timer setup=""" a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) """ stmt=""" torch.autograd.grad(a*b, [a, b], gradient) """ timer = Timer(stmt, setup) print(timer.timeit(10000)) print(timer.collect_callgrind(100)) ``` Before (when dummy node is not skipped): ``` torch.autograd.grad(a*b, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.62 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0> torch.autograd.grad(a*b, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9755488 9659378 Baseline: 4300 3784 100 runs per measurement, 1 thread ``` After ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730> torch.autograd.grad(a*b, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) 26.78 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0> torch.autograd.grad(a*b, [a, b], gradient) setup: a = torch.rand((2, 2), requires_grad=True) b = torch.rand((2, 2), requires_grad=True) gradient = torch.ones(2, 2) All Noisy symbols removed Instructions: 9045508 8939872 Baseline: 4280 3784 100 runs per measurement, 1 thread ``` **Cuda** Before ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 70.49 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5054581 4951911 Baseline: 4105 3735 100 runs per measurement, 1 thread ``` Remove dummy node only ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 55.65 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 5002105 4900841 Baseline: 4177 3731 100 runs per measurement, 1 thread ``` Remove dummy node and put in correct queue ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() 27.56 us 1 measurement, 10000 runs , 1 thread <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0> torch.autograd.grad(out, [x, y], gradient) setup: x = torch.rand((2,2), requires_grad=True, device="cuda") y = torch.rand((2,2), requires_grad=True, device="cuda") out = x + y gradient = torch.ones(2, 2).cuda() All Noisy symbols removed Instructions: 4104433 4007555 Baseline: 4159 3735 100 runs per measurement, 1 thread ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592 Reviewed By: ailzhang Differential Revision: D24890761 Pulled By: soulitzer fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2

Author

soulitzer

Committer

facebook-github-bot

Parents

957e45a9

pytorch d20483a9 - Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592)

pytorch
d20483a9 - Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592)