[Inductor] cpp further code cleanup (#98135)
This PR primarily made two changes:
1. Support all ops (not only the load related ops) for `ops.masked`. Do recursive checks on masked body in `CppVecKernelChecker`. With this, we can remove `is_load_only_block` function and corresponding checking logic in `masked`.
2. Change the loop steps to the vectorized scaling factor instead of scaling the vectorized loop variables. With this, we can remove all the code that scales the loop variables explicitly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98135
Approved by: https://github.com/EikanWang, https://github.com/jansel