Vectorize norm(double, p=2) on cpu (#91502)
This gives a speed up of 100x on my machine:
```
[------------------ Master -------------------]
| (200000, 3)
32 threads: ----------------------------------
torch linalg_norm | 10000
torch linalg_vector_norm | 10000
torch custom | 397
numpy norm | 3123
numpy custom_np | 3119
Times are in microseconds (us).
[------------------- PR -------------------]
| (200000, 3)
32 threads: ----------------------------------
torch linalg_norm | 107
torch linalg_vector_norm | 100
torch custom | 400
numpy norm | 3170
numpy custom_np | 3162
Times are in microseconds (us).
```
Fixes https://github.com/pytorch/pytorch/issues/91373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91502
Approved by: https://github.com/mingfeima, https://github.com/ngimel