Add `_sharded_cpu_state_dict` for distributed checkpointing (#5288)
* initiak commit
* Add test workflow for `xrt` branch (#5241)
* Add test workflow for `xrt` branch
* Only run for PRs targeting XRT branch
* Add function to generate stablehlo based callable from pytorch model (#5216)
* Add function to generate stablehlo based callable from pytorch model
Added function
`torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`.
This function will take a pytorch Module and convert it into stablehlo
bytecode.
* Only run the main CI workflow on PRs targeting master and release branches (#5244)
* Only run main CI for master and release branches.
* Disabling XRT tests on main CI
* AMP for TPUs v3 (#5161)
* remove duplicate autocast_test (#5246)
* Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)
* Install `expecttest` in xla_test_job.yaml (#5252)
* Add IAM roles for cloudbuild_editors (#5251)
* [Functionalization] Remove view in view_symint (#5231)
* [Functionalization] Remove view in view_symint
Summary:
This pull request removes views in tensor_method::view_symint.
Test Plan:
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
* Fix linters
* fixed the test
* ran the linter
---------
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
* Delete XRT from the main branch (#5240)
* Delete XRT from the main branch
* Remove dead import
* formatting
* Remove disable_xrt build option
* Fix runtime init
* Revert "Remove disable_xrt build option"
This reverts commit ba312e76e069bef40c8f9803a672b29409862804.
* Add disable XRT option back
* formatting
* Prune mesh service
* Remove obsolete test
* Remove other run server script
* Remove XRT config
* Update PJRT default device test
* Add a file I forgot to save
* if using_pjrt -> @requires_pjrt
* Remove irrelevant test case
* Remove XRT env vars
* fix md link
* formatting
* Remove extra `requires_pjrt`
* merge conflicts
* Add other autocast back
* Add nightly build for cuda 12 (#5253)
* Fix the linter command in the CI (#5254)
* fix linter command
* ran linter
* Jack cao g/fix spmd buff is null (#5256)
* Fix that non-tensor scalar can't be handled by virtual device
* add test
* comment
* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)
* Skip calling as_strided in empty_strided_symint.
* only return empty_symint conditionally.
* add a comment
* Add XRT nightly builds (#5261)
* Add XRT nightly builds
* remove space
* [OpenXLA] Migrate to pull XLA from OpenXLA (#5202)
PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09
* Add ToString method for both PjrtData and PjrtShardedData (#5265)
* Add ToString method for both PjrtData and PjrtShardedData
* on cpu same config will become replicated, dont't check actual op sharding type
* Update Sharded graph HLO dumping (#5266)
* Enable PjRt Client Compilation with StableHLO (#5233)
* Enable xla PjRt client compilation with StableHLO
* add XLA_STABLEHLO_COMPILE to configuration.yaml
* fix merge conflict
* dummy commit to trigger ci
* Revert "dummy commit to trigger ci"
This reverts commit f7aec233d18637e242427c4542b12cf65c431ebc.
* Disable Bazel remote cache for forked PR (#5259)
* disable bazel remote cache if gcloud key is empty
* remove remote cache from setup.py
* experiment with debug msg
* fix flag
* add more logs
* skip remote chache if credential file is empty
* add comment
* add logs
* add check in test and coverage script
* fix condition in coverage test
* advance branch pr
* allow remote cache if gloud file isn't specified explicitly
* remove dummy comment
* Suppress debug symbols in OpenXLA code (#5269)
* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)
* Make TPU detection more robust (#5271)
* Clean bazel stuff on distutils clean. (#5274)
* Clean bazel stuff on distutils clean
* Fix python formatting
* Delete unused .so file, and .lds files (#5275)
* [OpenXLA] Delete unused .so file and .lds files
* Fix the error when export_torch_model is given a non-tensor (#5277)
However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.
* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)
* Always do build_ext in python setup.py develop (#5273)
Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.
* Remove or improve several hardcoded TPU test conditions (#5272)
* Remove or improve several hardcoded TPU test conditions
* Fix test condition
* Add `runtime.host_index` (#5283)
* Make it an error if calling sizes() on a dynamic tensor. (#4998)
* Err if calling sizes() on dynamic tensor
* try to set has_symbolic_sizes_strides_
* resolve merge conflict
* enable CONTINUE_ON_ERROR
* fixed the python test test_SizeEq_should_not_compile_for_identical_symints
* fix test_index_types
* set CONTINUE_ON_ERROR to true
* remove some unwanted code.
* add a print
* directly set has_symbolic_sizes_strides_ = true
* make some fixes.
* fix empty_strided_symint
* ran linter
* change error type in the test.
* fix comments
* ran linter
* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)
* Fix the error where mark_step does not materalize tensors on SPMD:0
* typo
* fix test_non_tensor_scalar
* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)
* Set torch._dynamo.config.automatic_dynamic_shapes to False
* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape
* run linter
* wrap only if sharding type is non-replicated
* Handle non-tensors
* run linter
* Call wrap_if_sharded first
* Add exception in test for unsharded tensor
* fix test
* Use torch.Tensor instead of torch.tensor
* use .cpu() only for tensors
---------
Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>