[Attention] Add TOKENSPEED_MLA backend for DeepSeek R1 prefill + decode on Blackwell
Wires the tokenspeed_mla CuTe DSL kernels into vLLM as a new MLA backend,
covering both prefill (tokenspeed_mla_prefill) and decode
(tokenspeed_mla_decode). Targets Blackwell (SM100) with FP8 KV cache and
DeepSeek R1 MLA dimensions; users opt in via -ac
'{"backend":"TOKENSPEED_MLA","mla_prefill_backend":"TOKENSPEED_MLA"}'.
Includes numeric parity tests against the trtllm reference kernels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>