[SPMD] Support reduce-scatter in manual sharding #7231

alanwaketan · 2024-06-10T20:48:57Z

Summary:
This PR is to add experimental support of cc ops in manual sharding zones. This one adds reduce-scatter as the initial step. The key here is to add channel_id, replica_groups, and use_global_device_ids in the lowering.

Test Plan:
PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_spmd_reduce_scatter

alanwaketan · 2024-06-10T22:30:24Z

Why the hell do we run spmd tests on cpu?

JackCaoG · 2024-06-10T22:32:54Z

You can skip your test for CPU. For the most sharding related test they actually passed on CPU. CPU test is the only CI that upstream runs against us.

JackCaoG · 2024-06-10T22:33:56Z

torch_xla/csrc/init_python_bindings.cpp

          return result_tuple;
        });
+  m.def(
+      "_xla_spmd_reduce_scatter",


so the only difference is this one does not have token?

That's one difference. Others are mentioned in the description.

alanwaketan · 2024-06-11T06:56:46Z

The GPU test failure doesn't seem to be related.

alanwaketan added 7 commits June 6, 2024 06:29

initial commit

12b5a7e

Add a test case

ad5b7a6

fix the lowering

38b28e7

add canonical index

6f4f170

Fix test

bccdacd

Add more test

aa68e60

Fix linters

a212189

alanwaketan added the tpuci label Jun 10, 2024

alanwaketan requested review from JackCaoG and jonb377 June 10, 2024 20:48

alanwaketan self-assigned this Jun 10, 2024

Fix linters

fec3860

JackCaoG reviewed Jun 10, 2024

View reviewed changes

Skip non TPU devices

48eef41

JackCaoG approved these changes Jun 11, 2024

View reviewed changes

alanwaketan merged commit 70d2e9e into master Jun 11, 2024

alanwaketan added the backport_2.4 label Jul 17, 2024

alanwaketan mentioned this pull request Jul 17, 2024

2.4 backport PR request list #7242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPMD] Support reduce-scatter in manual sharding #7231

[SPMD] Support reduce-scatter in manual sharding #7231

Uh oh!

alanwaketan commented Jun 10, 2024 •

edited

Loading

Uh oh!

alanwaketan commented Jun 10, 2024

Uh oh!

JackCaoG commented Jun 10, 2024

Uh oh!

JackCaoG Jun 10, 2024

Uh oh!

alanwaketan Jun 11, 2024

Uh oh!

alanwaketan commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPMD] Support reduce-scatter in manual sharding #7231

[SPMD] Support reduce-scatter in manual sharding #7231

Uh oh!

Conversation

alanwaketan commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alanwaketan commented Jun 10, 2024

Uh oh!

JackCaoG commented Jun 10, 2024

Uh oh!

JackCaoG Jun 10, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alanwaketan commented Jun 10, 2024 •

edited

Loading