Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42890
- Removes dummy node
- Places graph root on the correct queue based on input buffer's device instead of cpu queue by default
cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs
cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs
**CPU**
Code:
```
import torch
from torch.utils.benchmark import Timer
setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""
stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""
timer = Timer(stmt, setup)
print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```
Before (when dummy node is not skipped):
```
torch.autograd.grad(a*b, [a, b], gradient)
setup:
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
26.62 us
1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
All Noisy symbols removed
Instructions: 9755488 9659378
Baseline: 4300 3784
100 runs per measurement, 1 thread
```
After
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
26.78 us
1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
All Noisy symbols removed
Instructions: 9045508 8939872
Baseline: 4280 3784
100 runs per measurement, 1 thread
```
**Cuda**
Before
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0>
torch.autograd.grad(out, [x, y], gradient)
setup:
x = torch.rand((2,2), requires_grad=True, device="cuda")
y = torch.rand((2,2), requires_grad=True, device="cuda")
out = x + y
gradient = torch.ones(2, 2).cuda()
70.49 us
1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50>
torch.autograd.grad(out, [x, y], gradient)
setup:
x = torch.rand((2,2), requires_grad=True, device="cuda")
y = torch.rand((2,2), requires_grad=True, device="cuda")
out = x + y
gradient = torch.ones(2, 2).cuda()
All Noisy symbols removed
Instructions: 5054581 4951911
Baseline: 4105 3735
100 runs per measurement, 1 thread
```
Remove dummy node only
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0>
torch.autograd.grad(out, [x, y], gradient)
setup:
x = torch.rand((2,2), requires_grad=True, device="cuda")
y = torch.rand((2,2), requires_grad=True, device="cuda")
out = x + y
gradient = torch.ones(2, 2).cuda()
55.65 us
1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20>
torch.autograd.grad(out, [x, y], gradient)
setup:
x = torch.rand((2,2), requires_grad=True, device="cuda")
y = torch.rand((2,2), requires_grad=True, device="cuda")
out = x + y
gradient = torch.ones(2, 2).cuda()
All Noisy symbols removed
Instructions: 5002105 4900841
Baseline: 4177 3731
100 runs per measurement, 1 thread
```
Remove dummy node and put in correct queue
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80>
torch.autograd.grad(out, [x, y], gradient)
setup:
x = torch.rand((2,2), requires_grad=True, device="cuda")
y = torch.rand((2,2), requires_grad=True, device="cuda")
out = x + y
gradient = torch.ones(2, 2).cuda()
27.56 us
1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0>
torch.autograd.grad(out, [x, y], gradient)
setup:
x = torch.rand((2,2), requires_grad=True, device="cuda")
y = torch.rand((2,2), requires_grad=True, device="cuda")
out = x + y
gradient = torch.ones(2, 2).cuda()
All Noisy symbols removed
Instructions: 4104433 4007555
Baseline: 4159 3735
100 runs per measurement, 1 thread
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592
Reviewed By: ailzhang
Differential Revision: D24890761
Pulled By: soulitzer
fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2