PR #37877 parallelism goes brrr

parallelism goes brrr #37877

ArthurZucker merged 59 commits into main from nouamane/nanotron

accept custom device_mesh

3d90a99d

fix device_map

df1eaee8

assert that num_heads % tp_size == 0

b9298864

todo.

1df751bc

ReplicateParallel

5887ffc1

handle tied weights

924cceec

handle dtensor in save_pretrained with safe_serialization

cfacec55

tp test works

98333058

doesnt work

7d7b3636

ArthurZucker added Tensor Parallel

ArthurZucker added Core: Modeling

ArthurZucker commented on 2025-05-01

S1ro1 commented on 2025-05-01

fix shard_and_distribute_module's rank should be local_rank

11f02a59

tp=4 is correct

317c0276

dp+tp is broken

f3b4ae81

todo allreduce with dtensors on another dim is annoying

f6a49ee8

workaround to sync dp grads when using dtensors

eaa65921

loading a checkpoint works

7c6219bc

wandb and compare losses with different tp/dp

6ceabe01

cleaning

a9a15925

NouamaneTazi requested a review from

ArthurZucker 255 days ago

NouamaneTazi requested a review from

S1ro1 255 days ago

NouamaneTazi marked this pull request as ready for review 255 days ago

cleaning

4e323a51

qubvel commented on 2025-05-02

7f327b13

c3e5c5ed

logs

810bd51a

CP2 DP2 no mask works after commenting attn_mask and is_causal from s…

82348732

DP=2 TP=2 now works even with tied embeddings

29c2a9ca

model.parameters() and model.module.parameters() are empty..

8fa760be

reformat sanity_check_tensor_sync

610e6bb0

set atol=1e-4 for CP to pass

75cad51d

try populate _parameters from named_modules

b816a3cc

refactors

688107c0

is_causal=True and pack sequences, no attn mask, and preshuffle dataset

cfe688b4

fix packing

83095210

CP=4 doesn't work

c0f616ee

fix labels and position_ids for CP

011d981e

DP CP works with transformers 🥳🥳🥳

265f90dc

refactor

afa72e24

add example cp

75176794

fixup

835726da

revert sdpa changes

0ad2a156

example cleared

5b119645

add CP, DP to the mesh init

7855d102

nit

0b2bd157

ArthurZucker commented on 2025-05-15

clean

c82d39ce

use `ALL_PARALLEL_STYLES`

957c351e

Merge branch 'nouamane/nanotron' of github.com:huggingface/transforme…

6d462e9f

style

43c175d0

FSDP works

378b2e7b

log on 1 rank

30752c63

9c1e1fc2

fix?

3f683b6e

Merge branch 'nouamane/nanotron' of github.com:huggingface/transforme…

d36acced

FSDP1 also has .parameters() bug

780d74d3

reported gradnorm when using FSDP1 is wrong, but loss is correct so i…

9e549694

ba01287a

style and fixup

677ce533

move stuff around

81c21de9

Merge branch 'main' of github.com:huggingface/transformers into nouam…

656277c5

fix tests

e27ddb85

style

d702d94d

let's make it a check

5083c0b0

warning should be an info

67a81826

ArthurZucker enabled auto-merge (squash) 237 days ago

disabled auto-merge 237 days ago
Manually disabled by user

ArthurZucker merged 1c2f36b4 into main 237 days ago

ArthurZucker deleted the nouamane/nanotron branch 237 days ago

LysandreJik restored the head branch 237 days ago

Reviewers

ArthurZucker

S1ro1

qubvel

Assignees

No one assigned

Labels

Core: Modeling Tensor Parallel

Milestone

No milestone

transformers parallelism goes brrr #37877 Merged

parallelism goes brrr #37877

transformers
parallelism goes brrr
#37877

Merged