onnxruntime
ce40b70f - Add stash_type attribute to SkipLayerNormalization ops

Commit

68 days ago

Add stash_type attribute to SkipLayerNormalization ops Add stash_type attribute (default=1, float32) to both SkipLayerNormalization and SkipSimplifiedLayerNormalization schemas, matching the existing attribute on standard LayerNormalization. When stash_type=1 and the input type is f16/bf16, the CUDA kernel routes to HostApplyLayerNorm which accumulates variance in float32. This prevents f16 overflow in deep networks where residual values grow large through skip connections (e.g. Qwen3-VL 28-layer with QK-norm produces residuals with absmax > 100 by layer 22). The fix addresses the root cause of f16 NaN in decoder models: SkipSimplifiedLayerNormalization accumulated x²/ld in f16, which overflows when x > ~180 for hidden_size=2048 (x²/2048 > 65504). Changes: - bert_defs.cc: Add stash_type attribute to both SkipLayerNorm and SkipSimplifiedLayerNorm schemas - skip_layer_norm.h: Add use_float_accumulation_ member - skip_layer_norm.cc: Read stash_type, use HostApplyLayerNorm (float accumulation) when stash_type=1 for f16/bf16 Tested: Qwen3-VL-2B f16 CUDA decoder produces correct output at all input magnitudes (std=0.01 to 0.5) with zero NaN. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

References

fix-skip-layer-norm-f16-overflow

#28442 - Fix f16 overflow in SkipLayerNormalization CUDA kernel

Author

justinchuby

Parents

3cc4cef0

onnxruntime ce40b70f - Add stash_type attribute to SkipLayerNormalization ops

onnxruntime
ce40b70f - Add stash_type attribute to SkipLayerNormalization ops