Increase numel limit to 2^63 for replicatepad1d (#122199)
Summary: As title
Test Plan:
```
CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing
```
Also benchmarked in N5106027
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
11.058772478103638 18.912256770000006 735.4118906278957
# after changes
10.621162576675415 18.58972748 765.7121070725207
```
Differential Revision: D55030372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199
Approved by: https://github.com/ezyang