[forward-fix] Fix multigpu varying tensor optim tests (#106887)
Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test.
The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887
Approved by: https://github.com/izaitsevfb