xla
Install torch wheel from pytorch to unblock TPU CI
#6691
Merged

Install torch wheel from pytorch to unblock TPU CI #6691

vanbasten23 merged 7 commits into master from fixNewTpuCI
vanbasten23
vanbasten231 year ago (edited 1 year ago)

#6681 helps uncover the silent failures in the TPU Integration Test / tpu-test (push). The failure is

+ python3 test/test_operations.py -v
Traceback (most recent call last):
  File "test/test_operations.py", line 47, in <module>
    import torchvision
  File "/usr/local/lib/python3.8/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "/usr/local/lib/python3.8/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/home/runner/.local/lib/python3.8/site-packages/torch/library.py", line [46](https://github.com/pytorch/xla/actions/runs/8180398097/job/22368345687#step:5:47)7, in inner
    handle = entry.abstract_impl.register(func_to_register, source)
  File "/home/runner/.local/lib/python3.8/site-packages/torch/_library/abstract_impl.py", line 30, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist

The failure is due to the torch wheel built by us and torchvison official wheel is not compatible with the torch wheel built by us.

vanbasten23 vanbasten23 requested a review from will-cromar will-cromar 1 year ago
vanbasten23 vanbasten23 requested a review from mbzomowski mbzomowski 1 year ago
vanbasten23
vanbasten231 year ago

Is there a way to test the change before the change is merged?

mbzomowski
mbzomowski1 year ago👍 1

Actions > TPU Integration Test > Run workflow > then select your branch, and it'll start the run from your edited workflow

FYI I opened a PR for this: #6690, which failed: https://github.com/pytorch/xla/actions/runs/8193035259

vanbasten23 vanbasten23 changed the title Reinstall torch wheel to unblock TPU CI Build torchvision from source to unblock TPU CI 1 year ago
vanbasten23
vanbasten231 year ago

Building torchvision from src is working https://github.com/pytorch/xla/actions/runs/8195031279: we don't have the the nms error and we don't have to rely on nightly torch wheels so that we have to break every time we make companion change in pytorch. cc @will-cromar @mbzomowski

vanbasten23 vanbasten23 force pushed from bcdc721f to 2d97ee56 1 year ago
vanbasten23
vanbasten231 year ago

Per offline discussion, it's less ideal to compile torchvision from src and it's better to stick with the original plan

vanbasten23 vanbasten23 changed the title Build torchvision from source to unblock TPU CI Install torch wheel from pytorch to unblock TPU CI 1 year ago
vanbasten23
vanbasten231 year ago

Current failure:

+ python3 test/test_operations.py -v
2024-03-08T19:03:57.5462724Z Traceback (most recent call last):
2024-03-08T19:03:57.5463798Z   File "test/test_operations.py", line 31, in <module>
2024-03-08T19:03:57.5464676Z     import torch_xla
2024-03-08T19:03:57.5467031Z   File "/home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/torch_xla/__init__.py", line 7, in <module>
2024-03-08T19:03:57.5469219Z     import _XLAC
2024-03-08T19:03:57.5479640Z ImportError: /home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch4lazy13MetricFnValueB5cxx11Ed
2024-03-08T19:03:57.8749327Z ##[error]Process completed with exit code 1.
vanbasten23
vanbasten231 year ago (edited 1 year ago)

torch has abi disabled. I think we enabled it when building torch_xla

vanbasten23 vanbasten23 force pushed from bc34f282 to 3c3d298b 1 year ago
vanbasten23
vanbasten231 year ago

Ok, some progress, we can import torch_xla atm. Now, it fails with:

======================================================================
FAIL: test_resnet18 (__main__.DynamoTrainingOptimizerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/dynamo/test_dynamo.py", line 518, in test_resnet18
    self.assertEqual(met.metric_data('CompileTime')[0], 6)
AssertionError: 7 != 6

after I rebase onto origin/master. It's because the companion upstream pr pytorch/pytorch#115621 was just reverted. But all others should be good.

vanbasten23 reinstall torch wheel to unblock TPU CI
bef203d8
vanbasten23 uninstall torch
25e34790
vanbasten23 update script
faf1a053
vanbasten23 try again with --user to install and don't uninstall torch/torchvision
b4a0620d
vanbasten23 another fix
f494ce26
vanbasten23 disable abi
7d1d4e81
vanbasten23 vanbasten23 force pushed from 3c3d298b to 7d1d4e81 1 year ago
will-cromar
will-cromar commented on 2024-03-08
Conversation is marked as resolved
Show resolved
.github/workflows/tpu_ci.yml
3032 env:
3133 PJRT_DEVICE: TPU
3234 run: |
35
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --user
will-cromar1 year ago

I know the error originates from the other file, but we don't need torchaudio and there is a CPU index https://download.pytorch.org/whl/nightly/cpu

We don't need to install the CUDA versions on TPU

vanbasten231 year ago

fixed.

will-cromar
will-cromar commented on 2024-03-08
Conversation is marked as resolved
Show resolved
.github/workflows/tpu_ci.yml
1111 steps:
1212 - name: Checkout and Setup PyTorch Repo
13 env:
14
_GLIBCXX_USE_CXX11_ABI: 0
will-cromar1 year ago👍 1

The C++11 ABI switch has been such an unnecessary headache. I don't know why the default for building torch doesn't match their published wheels....

will-cromar
will-cromar approved these changes on 2024-03-08
will-cromar1 year ago

This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.

I'm interested to see if #6704 works as a more stable solution.

vanbasten23
vanbasten231 year ago

This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.

Why CPU wheels? The old TPU CI uses the cuda wheel and our github index page also suggest to install the cuda torch wheel. So shouldn't our CI be consistent?

will-cromar
will-cromar1 year ago

The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.

vanbasten23
vanbasten231 year ago

The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.

If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?

will-cromar
will-cromar1 year ago

If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?

These are just nightly builds, so in my mind it doesn't matter.

vanbasten23 fix up
49254f12
vanbasten23
vanbasten231 year ago

Pending on a new TPU CI run. Stay tuned..

mbzomowski
mbzomowski1 year ago👍 1

Pending on a new TPU CI run. Stay tuned..

https://github.com/pytorch/xla/actions/runs/8240857381

Looks good

mbzomowski
mbzomowski approved these changes on 2024-03-11
vanbasten23
vanbasten231 year ago

The new TPU CI passed. Thanks for the review.

vanbasten23 vanbasten23 merged 6630287c into master 1 year ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone