Fix `_foreach_norm` on some tensor sizes (#91844)
This PR fixes 2 bugs with CUDA `_foreach_norm`:
1. Wrong norm when tensors are larger than kChunkSize = 65536
```
>>> torch._foreach_norm([torch.ones(60000, device="cuda") for _ in range(1)])
(tensor(244.9490, device='cuda:0', grad_fn=<NotImplemented>),)
>>> torch._foreach_norm([torch.ones(70000, device="cuda") for _ in range(1)])
(tensor(256., device='cuda:0', grad_fn=<NotImplemented>),)
>>> torch.ones(60000, device="cuda").norm()
tensor(244.9490, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>)
>>> torch.ones(70000, device="cuda").norm()
tensor(264.5751, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>)
```
2. Error when a tensor numel is smaller than the number of tensors
```
>> torch._foreach_norm([torch.ones(9, device="cuda") for _ in range(10)])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: select(): index 9 out of range for tensor of size [9] at dimension 0
```
This bug could have been caught by tests if `PYTORCH_TEST_WITH_SLOW` was 1, because it would have tested tensors of size 300*300=90000. It's not enabled by default, does someone know if it's ever enabled?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91844
Approved by: https://github.com/ngimel