construct only necessary elements in OffsetCalculator (#55107)
Summary:
Per title. Elements beyond `dim` are never accessed because https://github.com/pytorch/pytorch/blob/646510f7028f12e8b1f3a9d3b63b8519ed80e391/aten/src/ATen/cuda/detail/OffsetCalculator.cuh#L49-L51.
On `addmm` instruction count per 30 repetitions 1467813 -> 1452261
`add` 651522 -> 633462
`add_` 529331 -> 511271
add benchmarking snippet:
```
timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({2,2},device(at::kCUDA) ); at::Tensor b = torch::empty({2}, device(at::kCUDA));", language="c++", timer=timeit.default_timer)
stats=timer.collect_callgrind(number=30)
print(stats.as_standardized().stats(inclusive=False))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55107
Reviewed By: swolchok
Differential Revision: D27494492
Pulled By: ngimel
fbshipit-source-id: 23389a6bc9c9c0096751b95e7f9bf1c9f7bc594f