transformers
c5c69096 - Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517)

Commit

1 year ago

Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517) * fix the function load_balancing_loss_func in Mixtral_Moe to include attention_mask * format code using black and ruff * skip computing mask if attention_mask=None * add tests for load balancing loss Mixtral-Moe * fix assert loss is different in mixtral_test * fix pad_leng * use assertNotAlmostEqual and print to debug * remove print for debug * minor updates * reduce rtol and atol

References

#28517 - Exclude the load balancing loss of padding tokens in Mixtral-8x7B

Author

khaimt

Parents

5f81266f

transformers c5c69096 - Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517)

transformers
c5c69096 - Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517)