Add fastpath for dot and vdot when the inputs have conj bit set to True (#62915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62915
As much as 45% and 20% perf improvement on CUDA and CPU respectively.
consistent improvement in perf for all cases -- see perf numbers in comments below
Test Plan: Imported from OSS
Reviewed By: heitorschueroff
Differential Revision: D30404006
Pulled By: anjali411
fbshipit-source-id: 565940da28c7761d993cf43346932c24292e8a4d