Optimize relu on cpu using clamp_min (#50924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50924
`clamp_min` seems slightly faster than `threshold` (on avx2 cpus)
because it compiles down to vmaxps, rather than vcmpps+vblendv.
I see the biggest perf difference (about 20% faster) with float
tensors at 32k-64k elements. Bigger tensors are more memory bound
although it looks like it might still be a tiny win (2%).
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26009829
Pulled By: bertmaher
fbshipit-source-id: 7bb1583ffb3ee242e347f59be82e0712c7631f7e