[te] Speed up relu on cpu
Summary:
We were implementing it using ifThenElse, which creates conditional
branches that complicate llvm's vectorization. Using CompareSelect directly
yields clean vectorized code with nothing but vmovups and vmaxps.
Test Plan: Trivial benchmark shows 33% speedup on large tensors (256k elements).
Reviewed By: eellison
Differential Revision: D25986637
fbshipit-source-id: 72dd7776924f73c036d46dca30dff22404d86b82