xla
Cherry-pick 2.1 release branch into XRT branch through 9/14
#5574
Merged

Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

will-cromar merged 117 commits into xrt from wcromar/xrt-cherry-picks-9-14
will-cromar
JackCaoG Sharding should be per output of IR Node, instead of per IR Node (#5330)
d638f6e4
JackCaoG Update Python device API for SPMD (#5129)
ee2e6cf1
vanbasten23 Check out the release branch instead of origin/master in ansible (#5344)
6ef52a87
JackCaoG Also dump output sharding on HLO file (#5339)
9476d08b
will-cromar Make all-reduce a no-op when world size is 1 (#5342)
b23a3337
lsy323 add fs linker flag (#5347)
f203e938
lsy323 Add py3.10 whl path to doc, refactor whl table (#5354)
10cea648
baoleai fix amp dtype setting for GPU (#5337)
5257b5fd
JackCaoG Add python test for SPMD+Runtime Python API (#5349)
b07770d5
JackCaoG Check the actual device instead of query env var for virtual device (…
ab01165e
janeyx99 [BE] use self.assertEquals instead of str equality in test_zero1.py (…
bde253d0
JackCaoG Revert "[BE] use self.assertEquals instead of str equality in test_ze…
f172dcd5
ManfeiBai [Dynamo|TPU] Tweak `atol` and `rtol` for `test_dynamo.py` (#5363)
55491630
ManfeiBai [Dynamo|TPU] Skip`DynamoTrainingBasicTest.test_resnet18` on TPU (#5362)
c49daaf5
qihqi Add a script for running stablehlo tests. (#5360)
e7ae1d46
jonb377 Don't rewrite index hints in global save planning (#5348)
f993041e
ManfeiBai [Dynamo|TPU] Skip `DynamoInferenceBasicTest.test_resnet18` on TPU (#5…
c7d7b234
janeyx99 [BE] use self.assertEquals instead of str equality in test_zero1.py (…
3fec0494
JackCaoG Fix ReplicateShardedData for int type (#5374)
7df484cf
wonjoo-wj Update dynamo.md (#5378)
52ab1c69
JackCaoG Revert "Fix ReplicateShardedData for int type (#5374)" (#5380)
9ae2efe5
JackCaoG Remove the mention of XRT_TPU_CONFIG in the CONTRIBUTING.md (#5379)
84a6635d
ManfeiBai [Dynamo|TPU] Tweak `atol` and `rtol` for `test_simple_model_with_diff…
70b09d55
janeyx99 Rectify test_zero1.py once optim.load_state_dict doesn't guarantee im…
d2f82215
vanbasten23 Add gpu doc for how to build PyTorch/XLA from source with GPU support…
3ff13bf2
JackCaoG clear pending ir should also clear the cc op tokens (#5385)
6c2d7af5
jonb377 Port resnet data loading optimizations to SPMD test script (#5386)
87d397d0
wonjoo-wj Add support for in-place ops with self tensors in dynamo bridge (#5309)
59124a8d
ManfeiBai Add dynamo test in TPU CI (#5381)
ab682142
jonb377 Add manual seed in multihost checkpoint (#5392)
b13d1f2f
lsy323 Fix change_id type in coverage uploading (#5394)
48f7f551
wonjoo-wj Update dynamo cpu fallback op to aten::_foobar (#5393)
7a80658a
vanbasten23 Run single host multi GPU tests in the CI. (#5387)
a4a742de
will-cromar [PJRT] Separate collective ops test from TPU runtime test. (#5396)
1d99226e
JackCaoG Fix ReplicateShardedData for int type (#5404)
33500a50
JackCaoG Update the dynamo backend name to `openxla` (#5402)
009f6747
khatwanimohit [SPMD] Multi-host batch sharded data loading (#5331)
f786ddfa
qihqi Refactor to share code between export_torch_model and save_as_stableh…
2b2251f9
will-cromar Fix TPU collective ops test for multi-host TPUs (#5408)
26391c17
jonb377 Partially replicate lower-rank tensors (#5409)
88a68651
yeounoh Revert "Partially replicate lower-rank tensors (#5409)" (#5412)
239119ff
yeounoh SPMD cross slice-replication using partial_replication sharding (#5411)
fa13eb35
JackCaoG Fix the incorect clone arg condition in dynamo bridge (#5414)
3ffa11e9
yeounoh [SPMD] named partition spec support (#5415)
ca3c0c45
ManfeiBai [PJRT|TPU] Update `test_xla_devices_single_process_all_chips` for exp…
7c831ee6
will-cromar will-cromar changed the base branch from master to xrt 2 years ago
mateuszlewko Add repo for libcudnn8=8.7.0.84 and CUDA 11.8 (#5425)
16bc6283
will-cromar
aazzolini Update fix_includes.sh (#5441)
7bfff692
will-cromar [PJRT] Support `torchrun` with `pjrt://` `init_method` (#5438)
85150e53
qihqi Bugfix + add more test for llama (#5439)
b7bb55b3
JackCaoG Move the C++ test build to CI build job instead of test job (#5442)
6cf0446d
qihqi Update gcc to 10. (#5445)
60f5b0a9
JackCaoG Update the random seed for every dynamo execution (#5444)
55adfd10
lsy323 Revert "Update gcc to 10. (#5445)" (#5449)
c664b9fa
qihqi Install gcc-10 (#5450)
143dd079
will-cromar Revert "Install gcc-10 (#5450)" (#5452)
d1377979
JackCaoG parallelize SPMD inputhandler and GetDataShards (#5447)
f4890f22
will-cromar Remove base image override from TPU CI build (#5453)
023d3adb
will-cromar Update to GCC 10 (#5451)
22595dc1
JackCaoG Cache sharded placeholder for dynamo execution (#5446)
a1ab79c4
will-cromar Remove Docker image override from dev image (#5456)
f8ebe054
JackCaoG
will-cromar hack: implement (unimplement?) GetDataShard for XRT
89a27e6e
JackCaoG skip flaky test (#5459)
8e60236d
aws-kingrj Neuron import hook (#5429)
3bc6d427
peterbell10 Add missing includes (#5434)
c8650949
ManfeiBai [GPU]Update README.md with wheel/docker for CUDA12.0 and deprecate CU…
bacc9de5
lsy323 update remote cache key in ansible (#5463)
4ec5835a
lsy323 Fix data type in Pow with Scalar base and Tensor exponent (#5467)
edc3d615
JackCaoG bump the timeout for CI (#5470)
0136410c
JackCaoG Fix the input sharding for dynamo (#5469)
455a6e17
JackCaoG Enabling sharding device data IR (#5475)
0d741e3a
yeounoh Introduce `torch_xla.runtime.use_spmd()` (#5474)
a1ea65f8
aws-kingrj Enable PJRT C API Client and other changes for Neuron (#5428)
e174a2a7
Don't move full tensor to device in deferred_init (#4819)
b7756b63
alanwaketan [SPMD] Fix HybridMesh ordering (#5478)
5b6c284e
alanwaketan [SPMD] Properly skip tests on TPU V2 (#5479)
b7e9cb88
yeounoh Add @yeounoh to .github CODEOWNERS (#5482)
b42b47d9
lsy323 Add Python API to execute StableHLO bytecode (#5476)
c3ff9de1
alanwaketan [SPMD] Fix TPU CI after #5478 (#5487)
8e4b5434
alanwaketan [SPMD] Fix XLA_DUMP_POST_OPTIMIZATIONS test (#5485)
2bb5ff2c
hgt312 [Dist] Refactor ZeRO-1 (#5145)
9879617c
wonjoo-wj Update artifacts.auto.tfvars for 2.1 release (#5483)
bc098180
JackCaoG Add ShardingSpec to XLATensor when it is created with a PJRTShardedDa…
a8f2a266
wonjoo-wj Add topological sorting to dynamo partitions (#5472)
72f69d8d
alanwaketan [SPMD] Patch nn.Linear (#5491)
42cadd25
aws-kingrj [original author: mrnikwaws] Neuron operator support (#5471)
6378fca6
alanwaketan [SPMD] Make IR sharding custom sharding op (#5433)
80975f55
JackCaoG Support input sharding changed after first dynamo tracing (#5477)
b559584e
jonb377 Always use ExecuteReplicated with SPMD (#5494)
b9c69550
JackCaoG Skip a couple tests on TPU due to precision issue (#5496)
673fa1e4
qihqi Refactor stablehlo API and put them in official location. (#5493)
717f1d48
jonb377 Support tuples in partition spec (#5488)
3aec1959
JackCaoG Add a API to explictly init runtime (#5500)
4543cb09
JackCaoG Add explict error message when tensor is on CPU for dynamo backend (#…
4964915d
lsy323 remove torchvision in stablehlo.py (#5501)
25bda8db
jonb377 Fix tupled partition spec test on v3 (#5503)
fd956753
JackCaoG Update dynamo doc (#5506)
0495761b
shauheen Update dynamo.md (#5509)
08a3bbc7
jonb377 Support tuples in partition spec (#5488)
3aec1959
JackCaoG Add a API to explictly init runtime (#5500)
4543cb09
JackCaoG Add explict error message when tensor is on CPU for dynamo backend (#…
4964915d
lsy323 remove torchvision in stablehlo.py (#5501)
25bda8db
jonb377 Fix tupled partition spec test on v3 (#5503)
fd956753
qihqi Get original_traced_args as example_inputs. (#5511)
a85b860c
yeounoh mark_sharding over a replicated tensor is allowed. (#5513)
ff94c942
vanbasten23 Disable cxx abi in ansible when building pt/xla for branch r2.0 (#5332)
73873f1e
wonjoo-wj Update pytorch git tag for r2.1 (#5529)
9c623571
wonjoo-wj Enable megacore_dense by default (#5520) (#5531)
2d7e92f2
will-cromar Add option to unbundle libtpu (#5534) (#5536)
ff30d64c
wonjoo-wj Revert 2.1 terraform changes (#5537)
19d74d4c
wonjoo-wj Fix FSDP for Models with Frozen Weights (#5484) (#5539)
7e225099
will-cromar Update r2.1 wheel to be compatible with PyPI (#5550)
cdcc3d8a
wonjoo-wj Add resnet50-weight-quant colab notebook (#5407) (#5556)
565a915b
will-cromar hack: add placeholders for `HasSharding` and `GetSharding` to XRT
60c9ad02
will-cromar formatting
e324b7be
will-cromar
will-cromar hack: always return false from `HasSharding`
5f1e8b38
will-cromar Update torch pin to current RC for CI testing
41f8276c
will-cromar
will-cromar Cherry pick `pjrt://` init method rename and doc updates (#5562)
515cf17a
will-cromar will-cromar changed the title Cherry-pick commits from 9/27 to 8/9 into XRT branch Cherry-pick 2.1 release branch into XRT branch through 9/14 2 years ago
will-cromar Use new cache silo and skip test build
8416c066
will-cromar hack: disable missing test
da2dd820
will-cromar
will-cromar
will-cromar hack: alter cache silo name
586c0d01
will-cromar formatting
ebb3733a
will-cromar will-cromar marked this pull request as ready for review 2 years ago
will-cromar will-cromar requested a review from JackCaoG JackCaoG 2 years ago
will-cromar will-cromar requested a review from mateuszlewko mateuszlewko 2 years ago
will-cromar will-cromar requested a review from stgpetrovic stgpetrovic 2 years ago
JackCaoG
JackCaoG approved these changes on 2023-09-15
will-cromar will-cromar merged 7c32c0f4 into xrt 2 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone