Allow use cpu_serial_kernel with void-lambda (#27271)
Summary:
Currently we use CPU_tensor_apply1 to loop through the tensor in single thread and aggregate data:
```
// compute variance per input
accscalar_t var_sum = 0;
CPU_tensor_apply1<scalar_t>(in, [&] (const scalar_t& i) {
var_sum += (i - mean) * (i - mean);
});
```
and we don't have the ability to use TensorIterator for this.
```
accscalar_t var_sum = 0;
auto iter = TensorIterator::unary_op(self, self);
cpu_serial_kernel(iter, [&](scalar_t i) -> scalar_t {
var_sum += (i - mean) * (i - mean);
return a; //Unable to set value back, because self should be const
});
```
This PR should resolve this problem and allow to use void-lambda:
```
auto iter = at::TensorIterator();
iter.add_input(in);
iter.build();
accscalar_t var_sum = 0; \
at::native::cpu_serial_kernel(iter, [&](scalar_t i) -> void {
var_sum += (i - mean) * (i - mean);
});
```
In the future it make sense to change Reduction part and allow to reduce to a scalar, not just to a tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27271
Differential Revision: D17743310
Pulled By: ifedan
fbshipit-source-id: a149751f2d671aefd3ed84bd50b2c0543a63b701