Optimize tensor.slice() (#1381)
* Optimize tensor.slice()
The performance of executing `tensor.slice()` is super poor, especially for
the 'logits' tensor with large dimensions.
```
const logits = outputs.logits.slice(null, -1, null);`
```
This is because currently implementation of the `slice` method manually iterates
through each element and calculate indices which is a big time consuming if
the tensor shape is large.
For cases like `slice(null, -1, null)`, where the slicing operation is
contiguous along certain dimensions, which can be optimized by bulk copy
by using `TypeArray.subarray()` and `TypeArray.set()`.
* nit
* Add a few more tensor slice unit tests
---------
Co-authored-by: Joshua Lochner <26504141+xenova@users.noreply.github.com>