Optimize SiLU (Swish) op in PyTorch (#42976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976
Optimize SiLU (Swish) op in PyTorch.
Some benchmark result
input = torch.rand(1024, 32768, dtype=torch.float, device="cpu")
forward: 221ms -> 133ms
backward: 600ms -> 170ms
input = torch.rand(1024, 32768, dtype=torch.double, device="cpu")
forward: 479ms -> 297ms
backward: 1438ms -> 387ms
input = torch.rand(8192, 32768, dtype=torch.float, device="cuda")
forward: 24.34ms -> 9.83ms
backward: 97.05ms -> 29.03ms
input = torch.rand(4096, 32768, dtype=torch.double, device="cuda")
forward: 44.24ms -> 30.15ms
backward: 126.21ms -> 49.68ms
Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU"
Reviewed By: houseroad
Differential Revision: D23093593
fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd