benchmark
6712815f - fix accuracy failure for beit_base_patch16_224 during training (#130005)

Commit

1 year ago

fix accuracy failure for beit_base_patch16_224 during training (#130005) Summary: This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. X-link: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941 Reviewed By: kit1980 Differential Revision: D59413523 Pulled By: shunting314 fbshipit-source-id: d4d678b000bf497d1f48a3c74032bbd4d08aa5ac

Author

shunting314

Committer

facebook-github-bot

Parents

4b2105b9

benchmark 6712815f - fix accuracy failure for beit_base_patch16_224 during training (#130005)

benchmark
6712815f - fix accuracy failure for beit_base_patch16_224 during training (#130005)