Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129)
* Fix delayed AllReduce on Gemma-4 MoE
Skip forward past nodes that don't consume the current one, and allow a chain of MULs.
* Check for all sources before skipping nodes
* Address review comments