[aten] index_select dim 1 (#47077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47077
Add benchmarks for pt index_select, batch_index_select, and c2's BatchGather
Add batch_index_select implementation based on the C2 BatchGather implementation
This currently falls back to index_select for backwards and cuda implementations.
Alternatively, we can look into the specifics of why index_select is slower and
replace the original implementation instead.
Test Plan:
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/c2/batch_gather_test.par
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/index_select_test.par
PT results comparing without fix, block_size 1 only, and all dim=1
```
# no optimization
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 353.450
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 862.492
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4555.344
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 11003.279
```
```
# block size 1 only
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 129.240
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 266.776
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4508.593
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 10391.655
```
```
# dim 1
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K1_dim1_cpu
# Input: M: 8, N: 8, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 3.736
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 130.460
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 267.706
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K2_dim1_cpu
# Input: M: 8, N: 8, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4.187
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 1739.550
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 3468.332
```
C2 results:
```# Benchmarking Caffe2: batch_gather
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1203 13:19:35.310904 782584 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: batch_gather_M8_N8_K1_devicecpu
# Input: M: 8, N: 8, K: 1, device: cpu
Forward Execution Time (us) : 0.308
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K1_devicecpu
# Input: M: 256, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 90.517
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K1_devicecpu
# Input: M: 512, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 200.009
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M8_N8_K2_devicecpu
# Input: M: 8, N: 8, K: 2, device: cpu
Forward Execution Time (us) : 0.539
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K2_devicecpu
# Input: M: 256, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 1001.540
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K2_devicecpu
# Input: M: 512, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 2005.870
```
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_batch_gather
Reviewed By: hlu1
Differential Revision: D24630227
fbshipit-source-id: cd205a30d96a33d239f3266820ada9a90093cf91