Reimplement torch::flip based on advanced indexing (#56713)
Summary:
## Rationale
This PR improves the performance of `torch::flip` by using `TensorIterator` as the same fashion as using `AdvancedIndexing`. Which means that this implementation is semantically equivalent to indexing a tensor using reverse indices `A[dim0_size - 1:0 ..., dimN_size-1:0, ...]`.
## Benchmark results
The following benchmark compares the runtime of this implementation of `flip` against the current implementation, AdvancedIndexing with reversed indices, as well as OpenCV one. The comparison scenarios consider a 4D tensor `[B, C, H, W]`, where the dimensions flipped correspond to `H` (vertical flip) and `W` (horizontal flip) under float32 and uint8 datatypes.
The benchmark implementation details can be found in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/benchmarks.py. Additionally, there are correctness tests against the current flip implementation in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/main.cpp, which tests against different layouts, datatypes and contiguous/non-contiguous tensors.
The following plots correspond to the means of the runtime of each operator after 100 samples. As it is possible to observe, the latest implementation of flip has a runtime similar to the indexing one. Also, the performance gains are up to 6X under some scenarios.
### Horizontal flip (float)

### Horizontal flip (uint8)

### Vertical flip (float)

### Vertical flip (uint8)

cc fmassa vfdev-5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56713
Reviewed By: datumbox
Differential Revision: D28255088
Pulled By: fmassa
fbshipit-source-id: 5b8684812357c331e83a677b99cf0d78f0821678