SemanticDiff pytorch
2166fc55 - improve softmax lastdim performance on bfloat16 by adding more fusion

Loading