transformers
c5c69096 - Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517)

Commit
1 year ago
Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517) * fix the function load_balancing_loss_func in Mixtral_Moe to include attention_mask * format code using black and ruff * skip computing mask if attention_mask=None * add tests for load balancing loss Mixtral-Moe * fix assert loss is different in mixtral_test * fix pad_leng * use assertNotAlmostEqual and print to debug * remove print for debug * minor updates * reduce rtol and atol
Author
Parents
Loading