Add stash_type attribute to SkipLayerNormalization ops
Add stash_type attribute (default=1, float32) to both
SkipLayerNormalization and SkipSimplifiedLayerNormalization schemas,
matching the existing attribute on standard LayerNormalization.
When stash_type=1 and the input type is f16/bf16, the CUDA kernel
routes to HostApplyLayerNorm which accumulates variance in float32.
This prevents f16 overflow in deep networks where residual values
grow large through skip connections (e.g. Qwen3-VL 28-layer with
QK-norm produces residuals with absmax > 100 by layer 22).
The fix addresses the root cause of f16 NaN in decoder models:
SkipSimplifiedLayerNormalization accumulated x²/ld in f16, which
overflows when x > ~180 for hidden_size=2048 (x²/2048 > 65504).
Changes:
- bert_defs.cc: Add stash_type attribute to both SkipLayerNorm
and SkipSimplifiedLayerNorm schemas
- skip_layer_norm.h: Add use_float_accumulation_ member
- skip_layer_norm.cc: Read stash_type, use HostApplyLayerNorm
(float accumulation) when stash_type=1 for f16/bf16
Tested: Qwen3-VL-2B f16 CUDA decoder produces correct output at
all input magnitudes (std=0.01 to 0.5) with zero NaN.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>