[inductor] Lower masked_scatter on CUDA (#108803)
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
ghstack dependencies: #108802