[MPS] Add roll op (#95168)
Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know).
Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS.
Did a small benchmark:
```python
for num in [10, 100, 1000, 10000]:
for shft in [1, 5]:
sz = num * num
x = torch.arange(sz, device="cpu").view(num, num)
s = time.time()
r = torch.roll(x, shft)
cpu_e = time.time() - s
x = torch.arange(sz, device="mps").view(num, num)
s = time.time()
r = torch.roll(x, shft)
mps_e = time.time() - s
print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}")
```
```
size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578
size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172
size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172
size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375
size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938
size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203
size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945
size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168
Approved by: https://github.com/albanD