Optimize slicing when possible by copying bigger blocks at once (#13261)
### Description
Currently, SliceIterator copies inner dimension size at once at best.
However, there are many slices when several inner dimensions can be
copied at once.
Furthermore, even if a dimension is sliced, it may employ step 1 and,
therefore, has a continuous block of inner dimensions that can be copied
at once.
### Motivation and Context
For example, `[N, C, H, W]` with slice `[:, :, i:, :]` and `[N, C, H-i,
W]`. Meaning, we slice along single axis, with step = 1. Current
implementation does `C * (H-i) memcpy` with W elements each. With this
change we can do `C memcpy with (H-i)*W` elements each.
The optimization produces ~11% savings on certain internal models.