Megatron-DeepSpeed
adding scalenorm, attention_init_method and relu^2
#139
Open

Loading