DeepSpeed
b9a5e995 - Fix autoTP test numerical tolerance with assert_close

Commit
26 days ago
Fix autoTP test numerical tolerance with assert_close Replace torch.allclose() with torch.testing.assert_close() and add rtol parameter for proper floating-point comparisons in testRowParallel and testColumnParallel tests. The tests were failing intermittently in CI because they only used absolute tolerance (atol=1e-2) without relative tolerance. Adding rtol=1e-2 allows for proper numerical comparisons where value magnitudes vary. Also restore normal workflow execution (remove debug steps). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Parents
Loading