Vectorize tensor lerp kernel (#84845)
Fixes #86964
In a simple timeit benchmark I see 1.7x speedup for complex64, from 6.7 us to
3.9 us; and a 3.2x speedup for float32, from 6.2 us to 1.9 us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84845
Approved by: https://github.com/lezcano, https://github.com/malfet