fix segfault for EmbeddingBag on CPU slow path when include_last_offset is true (#90358)
This PR is to fix the segfault reported at https://github.com/pytorch/pytorch/issues/89677, this is a `double free` issue caused by `invalid read`.
The reported issue broke at slow path for `EmbeddingBag` on float32, at [EmbeddingBag.cpp#L451](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L451)
Root cause is that `add_indices` has index which exceeds range of `output_data`, for the reported case.
The offsets are given as
```
{0, 6, 12, 15, 25, 32, 40, 42, 46, 53, 53}
```
The `indices` has 55 elements and `offsets[-1] != indices.size(0)`.
When `include_last_offset` is true, the `output` will be in the shape of {offsets.size(0) - 1, weight.sizes()[1]}, which will be {10, 5}.
Originally, `add_indices` will be (i re-arange the 1D tensor by rows, so here 10 rows in total)
```
### this is 55 elements
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5 5 5
6 6
7 7 7 7
8 8 8 8 8 8 8
10 10
```
The last row has index of 10 which is out of range of output tensor whose size is [10, 5].
The reason is `make_offset2bag` at [EmbeddingBag.cpp#L66](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L66) would give the following `offset2bag`:
```
### this is 55 + 1 elements:
0 0 0 0 0 0 1
0 0 0 0 0 1
0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
0 1
0 0 0 1
0 0 0 0 0 0 2
0 0
```
Notice for index 53, it is added twice.
The fix is ignore the last index from `offsets` when `include_last_offset` is true, also this behavior aligns with CUDA, quote from https://github.com/pytorch/pytorch/pull/57208#issuecomment-1021727378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90358
Approved by: https://github.com/ezyang