Limits cpu scalar error message to where it's appropriate (#42360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40986.
TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars.
A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360
Reviewed By: ngimel
Differential Revision: D22868536
Pulled By: mruberry
fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671