Optimize size(dim) and stride(dim)
This improves `c10::maybe_wrap_dim` to short-cut the "happy path"
where dim is in the correct range, and also moves the error and scalar
edge-cases out-of-line. These changes cut callgrind instruction counts
for `size(i)` from 5200 to 2000.
In the `size` and `stride` methods themselves, I also avoid calling
`TensorImpl::dim()` since it may be a virtual call. This further
reduced the instruction count from 2000 to 1500.
For comparison, `tensor.sizes()[0]` takes 1200 instructions so
`tensor.size(0)` is still marginally slower. This is unavoidable
though since it has to handle dimension wrapping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75416
Approved by: https://github.com/Lezcano, https://github.com/ngimel