Add torch._foreach_zero_ API (#47286)
Summary:
**In this PR**
- add `_foreach_zero_` API
- Update all optimizers under /_multi_tensor/ to use `_foreach_zero_` in `zero_grad` method
Performance improvement
----------------- OP: zero_ -----------------
for-loop: 630.36 us
foreach: 90.84 us
script
```
import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
import torch.utils.benchmark as benchmark_utils
inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)]
def main():
for op in [
"zero_"
]:
print("\n\n----------------- OP: ", op, " -----------------")
stmt = "[torch.{op}(t) for t in inputs]"
timer = benchmark_utils.Timer(
stmt=stmt.format(op = op),
globals=globals(),
label="str(optimizer)",
)
print(f"autorange:\n{timer.blocked_autorange()}\n\n")
stmt = "torch._foreach_{op}(inputs)"
timer_mta = benchmark_utils.Timer(
stmt=stmt.format(op = op),
globals=globals(),
label="str(optimizer_mta)",
)
print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n")
if __name__ == "__main__":
main()
```
**TODO**
- Refactor zero_grad once foreach APIs are stable.
**Tested** via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47286
Reviewed By: ngimel
Differential Revision: D24706240
Pulled By: izdeby
fbshipit-source-id: aac69d6d134d65126ae8e5916f3627b73d8a94bf