Use cascade summation in nll_loss on CPU (#55841)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55657
This also avoids summing `total_weight_val` when weights aren't supplied. Avoiding accumulated error completely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55841
Reviewed By: jbschlosser
Differential Revision: D27751492
Pulled By: ngimel
fbshipit-source-id: 2c2dc48f31c25dfa9db48693e3f765b179771a3c