cumsum, cumprod, logcumsumexp: adjust grain size (#94025)
Common issue when paralleling with `TensorIterator`, if the problem size is described as [M, N, K] and [M, N] is reflected in TensorIterator (with K being folded), `grain_size` should also be divided by K.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94025
Approved by: https://github.com/XiaobingSuper