PR #6691 Install torch wheel from pytorch to unblock TPU CI

vanbasten231 year ago (edited 1 year ago)

#6681 helps uncover the silent failures in the TPU Integration Test / tpu-test (push). The failure is

+ python3 test/test_operations.py -v
Traceback (most recent call last):
  File "test/test_operations.py", line 47, in <module>
    import torchvision
  File "/usr/local/lib/python3.8/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "/usr/local/lib/python3.8/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/home/runner/.local/lib/python3.8/site-packages/torch/library.py", line [46](https://github.com/pytorch/xla/actions/runs/8180398097/job/22368345687#step:5:47)7, in inner
    handle = entry.abstract_impl.register(func_to_register, source)
  File "/home/runner/.local/lib/python3.8/site-packages/torch/_library/abstract_impl.py", line 30, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist

The failure is due to the torch wheel built by us and torchvison official wheel is not compatible with the torch wheel built by us.

vanbasten23 requested a review from

will-cromar 1 year ago

vanbasten23 requested a review from

mbzomowski 1 year ago

vanbasten231 year ago

Is there a way to test the change before the change is merged?

mbzomowski1 year ago👍 1

Actions > TPU Integration Test > Run workflow > then select your branch, and it'll start the run from your edited workflow

FYI I opened a PR for this: #6690, which failed: https://github.com/pytorch/xla/actions/runs/8193035259

vanbasten23 changed the title ~~Reinstall torch wheel to unblock TPU CI~~ Build torchvision from source to unblock TPU CI 1 year ago

vanbasten231 year ago

Building torchvision from src is working https://github.com/pytorch/xla/actions/runs/8195031279: we don't have the the nms error and we don't have to rely on nightly torch wheels so that we have to break every time we make companion change in pytorch. cc @will-cromar @mbzomowski

vanbasten23 force pushed from bcdc721f to 2d97ee56 1 year ago

vanbasten231 year ago

Per offline discussion, it's less ideal to compile torchvision from src and it's better to stick with the original plan

vanbasten23 changed the title ~~Build torchvision from source to unblock TPU CI~~ Install torch wheel from pytorch to unblock TPU CI 1 year ago

vanbasten231 year ago

Current failure:

+ python3 test/test_operations.py -v
2024-03-08T19:03:57.5462724Z Traceback (most recent call last):
2024-03-08T19:03:57.5463798Z   File "test/test_operations.py", line 31, in <module>
2024-03-08T19:03:57.5464676Z     import torch_xla
2024-03-08T19:03:57.5467031Z   File "/home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/torch_xla/__init__.py", line 7, in <module>
2024-03-08T19:03:57.5469219Z     import _XLAC
2024-03-08T19:03:57.5479640Z ImportError: /home/runner/.local/lib/python3.8/site-packages/torch_xla-2.3.0+git177eb6e-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch4lazy13MetricFnValueB5cxx11Ed
2024-03-08T19:03:57.8749327Z ##[error]Process completed with exit code 1.

vanbasten231 year ago (edited 1 year ago)

torch has abi disabled. I think we enabled it when building torch_xla

vanbasten23 force pushed from bc34f282 to 3c3d298b 1 year ago

vanbasten231 year ago

Ok, some progress, we can import torch_xla atm. Now, it fails with:

======================================================================
FAIL: test_resnet18 (__main__.DynamoTrainingOptimizerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/dynamo/test_dynamo.py", line 518, in test_resnet18
    self.assertEqual(met.metric_data('CompileTime')[0], 6)
AssertionError: 7 != 6

after I rebase onto origin/master. It's because the companion upstream pr pytorch/pytorch#115621 was just reverted. But all others should be good.

reinstall torch wheel to unblock TPU CI

bef203d8

uninstall torch

25e34790

update script

faf1a053

try again with --user to install and don't uninstall torch/torchvision

b4a0620d

another fix

f494ce26

disable abi

7d1d4e81

vanbasten23 force pushed from 3c3d298b to 7d1d4e81 1 year ago

will-cromar commented on 2024-03-08

Conversation is marked as resolved

Show resolved

will-cromar commented on 2024-03-08

Conversation is marked as resolved

Show resolved

will-cromar approved these changes on 2024-03-08

will-cromar1 year ago

This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.

I'm interested to see if #6704 works as a more stable solution.

vanbasten231 year ago

This lgtm as a temporary fix if it's working. Please update to the CPU wheels before merging.

Why CPU wheels? The old TPU CI uses the cuda wheel and our github index page also suggest to install the cuda torch wheel. So shouldn't our CI be consistent?

will-cromar1 year ago

The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.

vanbasten231 year ago

The CPU wheel is much smaller and will download more quickly. The other TPU CI should also be using CPU wheels. Our release builds will get tested against the final upstream CUDA wheel.

If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?

will-cromar1 year ago

If our github index page suggests our users to use cuda torch wheel, should we do the same? Or it doesn't matter much?

These are just nightly builds, so in my mind it doesn't matter.

fix up

49254f12

vanbasten231 year ago

Pending on a new TPU CI run. Stay tuned..

mbzomowski1 year ago👍 1

Pending on a new TPU CI run. Stay tuned..

https://github.com/pytorch/xla/actions/runs/8240857381

Looks good

mbzomowski approved these changes on 2024-03-11

vanbasten231 year ago

The new TPU CI passed. Thanks for the review.

vanbasten23 merged 6630287c into master 1 year ago

xla
Install torch wheel from pytorch to unblock TPU CI
#6691

Merged

Install torch wheel from pytorch to unblock TPU CI #6691

30	32	env:
31	33	PJRT_DEVICE: TPU
32	34	run: \|
	35	pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --user

11	11	steps:
12	12	- name: Checkout and Setup PyTorch Repo
	13	env:
	14	_GLIBCXX_USE_CXX11_ABI: 0

xla Install torch wheel from pytorch to unblock TPU CI #6691 Merged

Install torch wheel from pytorch to unblock TPU CI #6691

xla
Install torch wheel from pytorch to unblock TPU CI
#6691

Merged