Handle tail 0-size tensor appropriately in `MultiTensorApply` (#100811)
Fixes #100701
It seems like we don't call `multi_tensor_apply_kernel` at all if the input tensor lists are small and their last tensors are zero-size as per e.g. https://github.com/pytorch/pytorch/blob/ca9f55f79d944672cb93157836f8ee92f54d2e10/aten/src/ATen/native/cuda/MultiTensorApply.cuh#L100-L102
which was introduced in https://github.com/pytorch/pytorch/commit/05943712a443138497c185405b575043b2916f34.
This PR special cases the last zero-size tensors so that we won't be negligent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100811
Approved by: https://github.com/ngimel