Add MEA+decode support in ONNX Attention LLM op
Enable Memory Efficient Attention (cutlass FMHA) to handle decode
steps with past_key/past_value, previously restricted to Flash only.
Changes:
- Add LaunchConcatNewToPastKV before MEA dispatch to concatenate
past_key+K into present_key (and past_value+V into present_value)
following the same pattern as the Flash decode path
- Remove past_key==nullptr eligibility check from mea_eligible
- Track kv_is_bsnh separately from is_bsnh since present buffers are
always BNSH after concat; pass kv_is_bsnh to LaunchUngroup and MEA
params for correct stride computation
- Set present_kv_already_populated=true after concat to skip redundant
post-attention present_key/value copy
- Enforce head_size==v_head_size for MEA decode (LaunchConcatNewToPastKV
uses a single head_size parameter)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent-signed-off: Developer (16a065d8) [claude-opus-4.6]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>