[MPS] Add native `cumsum` implementation (#88319)
Using https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/4057333-cumulativesumwithtensor?language=objc
Fall back to CPU if running on older MacOS versions
In `unary_op` add output tensor dims/dtype to the graph key (as even in default op we check output graph type)
Also, upcast int16 to int32 as MPS cumsum op on Ventura returns incorrect results for Int16 type (and it makes total sense for int8, as chances for overflow are very high)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88319
Approved by: https://github.com/kulinseth