xla
Add `_sharded_cpu_state_dict` for distributed checkpointing
#5288

Merged

Add `_sharded_cpu_state_dict` for distributed checkpointing #5288

jonb377 merged 46 commits into pytorch:master from yashjs_sharded_cpu_state_dict

initiak commit

1185c3ea

Add test workflow for `xrt` branch (#5241)

aeb90bcf

Add function to generate stablehlo based callable from pytorch model …

e7c49219

Only run the main CI workflow on PRs targeting master and release bra…

141514b2

AMP for TPUs v3 (#5161)

4a142e0b

remove duplicate autocast_test (#5246)

c8052632

Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)

2b6f2849

Install `expecttest` in xla_test_job.yaml (#5252)

c4611a1e

Add IAM roles for cloudbuild_editors (#5251)

b0fbb485

[Functionalization] Remove view in view_symint (#5231)

2606a309

Delete XRT from the main branch (#5240)

1901688c

Add nightly build for cuda 12 (#5253)

37ac0495

Fix the linter command in the CI (#5254)

6754db49

Jack cao g/fix spmd buff is null (#5256)

ec471f5d

Skip calling as_strided in empty_strided_symint if the input has dyna…

a2f8a93d

Add XRT nightly builds (#5261)

db7f8ee5

[OpenXLA] Migrate to pull XLA from OpenXLA (#5202)

bf759cfe

Add ToString method for both PjrtData and PjrtShardedData (#5265)

8aa92dd1

Update Sharded graph HLO dumping (#5266)

cf3bef8c

Enable PjRt Client Compilation with StableHLO (#5233)

00191db6

Disable Bazel remote cache for forked PR (#5259)

31fbc332

Suppress debug symbols in OpenXLA code (#5269)

3c0450a6

[SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

82a8041a

Make TPU detection more robust (#5271)

0d37af46

Clean bazel stuff on distutils clean. (#5274)

b0a70d3d

Delete unused .so file, and .lds files (#5275)

03d4f70e

Fix the error when export_torch_model is given a non-tensor (#5277)

15e32b25

Dsiable test_simple_model_with_different_input_shape since it is curr…

42a41a1e

Always do build_ext in python setup.py develop (#5273)

60217dba

Remove or improve several hardcoded TPU test conditions (#5272)

4af36bac

Add `runtime.host_index` (#5283)

a6f72731

Make it an error if calling sizes() on a dynamic tensor. (#4998)

fa6ff04f

Fix the error where mark_step does not materalize tensors on SPMD:0 (…

97284611

Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

8c13a267

Merge branch 'master' of https://github.com/pytorch/xla into yashjs_s…

9e745815

run linter

aed264fd

jonb377 commented on 2023-07-07

wrap only if sharding type is non-replicated

2797df3d

shahyash10 requested a review from

jonb377 2 years ago

jonb377 commented on 2023-07-10

Merge branch 'master' of https://github.com/pytorch/xla into yashjs_s…

34d7f9e5

Handle non-tensors

0842686c

run linter

97f697f7

shahyash10 requested a review from

jonb377 2 years ago

jonb377 commented on 2023-07-10

Call wrap_if_sharded first

1c78d8e3

shahyash10 requested a review from

jonb377 2 years ago

Add exception in test for unsharded tensor

1faeac36

Merge branch 'master' of https://github.com/pytorch/xla into yashjs_s…

7c614a2a

fix test

4e4d04a5

Use torch.Tensor instead of torch.tensor

a82e8ea2

jonb377 commented on 2023-07-12

use .cpu() only for tensors

89ae5684

shahyash10 requested a review from

jonb377 2 years ago

jonb377 approved these changes on 2023-07-13

jonb377 merged 46a0117b into master 2 years ago

shahyash10 deleted the yashjs_sharded_cpu_state_dict branch 2 years ago

Reviewers

jonb377

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

xla Add `_sharded_cpu_state_dict` for distributed checkpointing #5288 Merged

Add `_sharded_cpu_state_dict` for distributed checkpointing #5288

xla
Add `_sharded_cpu_state_dict` for distributed checkpointing
#5288

Merged