Fuse ops in eager cosine_similarity while keeping the stability and the gradients (#104771)
There was a regression in https://github.com/pytorch/pytorch/pull/31378
which was reported in https://github.com/pytorch/pytorch/issues/104564.
This PR should keep the efficiency and memory usage from the original
implementation, while keeping the stability of the latter.
This solution was already discussed in https://github.com/pytorch/pytorch/pull/31378,
but it was discarded because it modified the vector_norm in-place. The
only magic ingredient that was missing for that solution to work was to
add a `clone()` after calling the `vector_norm`.
I hope this PR takes shorter to land than https://github.com/pytorch/pytorch/issues/104564.
Fixes https://github.com/pytorch/pytorch/issues/104564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104771
Approved by: https://github.com/albanD