Add last_n_window_collector
Summary:
Add `last_n_window_collector` as C2 supports and PyTorch currently does not have this operator: https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/caffe2/operators/last_n_window_collector.cc?lines=139
## Problem that we are solving
This operator works on multiple pieces of data and collects last `n` element that has been seen.
If you have the following pieces of data that has been passed around:
```
[1, 2, 3, 4]
[5, 6, 7]
[8, 9, 10, 11]
```
for 3 times and the number of collector is given to be 6. The expected result is:
```
[6, 7, 8, 9, 10, 11]
```
What this means is that, almost like we need a FIFO(First in First Out) mechanism where as we are passing this data through the collector, we will be pushing some other data at the end.
In this particular example, in the first pass(the data is `[1, 2, 3, 4]`) , we hold `[1, 2, 3, 4]` in the queue as our queue size is 6.
In the second pass(the data is `[5, 6, 7]`), we hold `[2, 3, 4, 5, 6, 7]` in the queue and since 1 is inserted the last, it will drop due to the size limitation of the queue.
In the third pass(the data is `[8, 9, 10, 11]`), we hold `[6, 7, 8, 9, 10, 11]` in the queue and `2,3,4,5` are dropped due the the size of the queue.
For multidimension case, when we have the following data:
```
[[1, 2], [2, 3], [3, 4], [4, 5]]
[[5, 6], [6, 7], [7, 8]]
[[8, 9], [9, 10], [10, 11], [11, 12]]
```
and our queue size is 6.
In the first pass, we will have ` [[1, 2], [2, 3], [3, 4], [4, 5]]`
In the second pass, we will have `[2, 3], [3, 4], [4, 5]] [[5, 6], [6, 7], [7, 8]]`
In the third pass, we will have `[6, 7], [7, 8]] [[8, 9], [9, 10], [10, 11], [11, 12]]`
### The implementation
I am using FIFO queue in Python which is in the collections library. This accepts `maxlen` argument which can be used to set the size of the queue.
I am using last n indices of the tensor through list indices and in this operator, I am not doing copy.
In the test plan, I have both single dimension tensors as well as multi-dimension tensors.
### Benchmark
I used various different configurations and added a benchmark test. PyTorch implementation is much master than Caffe2 implementation:
#### CPU Benchmark
```
torch_response.median
0.00019254473969340324
caffe_response.median
0.00030233583599794657
```
#### GPU Benchmark
```
torch_response.mean
0.000081007429903838786
caffe_response.mean
0.00010279081099724863
```
Test Plan:
### For CPU:
```
buck test //caffe2/torch/fb/sparsenn:test
```
### For GPU:
- Used an on-demand machine and did the following commands:
```
jf get D24435544
buck test mode/opt //caffe2/torch/fb/sparsenn:test
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4222124688138052/
Reviewed By: dzhulgakov, radkris-git
Differential Revision: D24435544
fbshipit-source-id: 8193b4746b20f2a4920fd4d41271341045cdcee1