A few performance improvements coming out of ssd_mobilenet and ssd_resnet34 analysis (#1578)
* A few performance improvements:
- Make the iteration in NonZero more efficient by using a raw pointer and simplifying the increment logic
- add another unit test to check the new logic works with 3 dimensional tensor
- gains about 2% for ssd_mobilenet
- Avoid floating point operations on each iteration on Concat
- about 0.5% for ssd_mobilenet and ssd_resnet34
- Put common case first in ExecutionFrame::AllocateAsPerAllocationPlan to avoid unnecessary call to IsSparseTensor
- about 0.05% for ssd_mobilenet
- Minor tweak to put some ctors in the TensorShape header so they can be inlined more easily