Replacement is irrelevant for 1-sample multinomial (#86342)
So use fast path, both on CPU and on MPS
Also, remove some spurious copy-n-paste checks from MPS codepath
CUDA already has this optimization, see
https://github.com/pytorch/pytorch/blob/dc9c507d24d0c833cb09105177326f1f6bbe99c4/aten/src/ATen/native/cuda/MultinomialKernel.cu#L355-L356
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86342
Approved by: https://github.com/ngimel