use libdevice for tanh (#90889)
Per title
I see slight differences in perf with this implementation, where standalone tanh is slightly slower for a tensor of 4000000
elements (20.4 us instead of 19.4us), other sizes are within noise.
@bertmaher could you check if it affects your benchmarks?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90889
Approved by: https://github.com/bertmaher, https://github.com/anijain2305
Author
Natalia Gimelshein