Is there a way to test the change before the change is merged?
Actions > TPU Integration Test > Run workflow > then select your branch, and it'll start the run from your edited workflow
FYI I opened a PR for this: #6690, which failed: https://github.com/pytorch/xla/actions/runs/8193035259
Building torchvision from src is working https://github.com/pytorch/xla/actions/runs/8195031279: we don't have the the nms
error and we don't have to rely on nightly torch wheels so that we have to break every time we make companion change in pytorch. cc @will-cromar @mbzomowski
Per offline discussion, it's less ideal to compile torchvision from src and it's better to stick with the original plan
Current failure:
+ python3 test/test_operations.py -v
2024-03-08T19:03:57.5462724Z Traceback (most recent call last):
2024-03-08T19:03:57.5463798Z File "test/test_operations.py", line 31, in <module>
2024-03-08T19:03:57.5464676Z import torch_xla
2024-03-08T19:03:57.5467031Z File "/home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/torch_xla/__init__.py", line 7, in <module>
2024-03-08T19:03:57.5469219Z import _XLAC
2024-03-08T19:03:57.5479640Z ImportError: /home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch4lazy13MetricFnValueB5cxx11Ed
2024-03-08T19:03:57.8749327Z ##[error]Process completed with exit code 1.
torch has abi disabled. I think we enabled it when building torch_xla
Ok, some progress, we can import torch_xla atm. Now, it fails with:
======================================================================
FAIL: test_resnet18 (__main__.DynamoTrainingOptimizerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/dynamo/test_dynamo.py", line 518, in test_resnet18
self.assertEqual(met.metric_data('CompileTime')[0], 6)
AssertionError: 7 != 6
after I rebase onto origin/master. It's because the companion upstream pr pytorch/pytorch#115621 was just reverted. But all others should be good.
This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.
I'm interested to see if #6704 works as a more stable solution.
This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.
Why CPU wheels? The old TPU CI uses the cuda wheel and our github index page also suggest to install the cuda torch wheel. So shouldn't our CI be consistent?
The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.
The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.
If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?
If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?
These are just nightly builds, so in my mind it doesn't matter.
Pending on a new TPU CI run. Stay tuned..
Pending on a new TPU CI run. Stay tuned..
https://github.com/pytorch/xla/actions/runs/8240857381
Looks good
The new TPU CI passed. Thanks for the review.
Login to write a write a comment.
#6681 helps uncover the silent failures in the
TPU Integration Test / tpu-test (push)
. The failure isThe failure is due to the torch wheel built by us and torchvison official wheel is not compatible with the torch wheel built by us.