Refactor TensorAt, prepare for release (#5180)
* Refactor TensorAt
locations* must be const and int64_t since our dims are int64_t
Remove unnecessary copy of locations.
Remove unnecesary casting and C-casting. Simplify implementation.
Add a check for string type.
Make CXX api return T& to fully expose C API in C++, const std::vector& by value as it
covers more ground and eliminate redundant copy.
Eliminate inner loop, compute strides first.