test(ark): prefill perf at 2K/4K seq len with warmup + averaging
Update test/test_ark/test_moe_model_perf.py:
- Drop the single 128-token prefill prompt; benchmark prefill at seq_len
2048 and 4096 instead, and surface seq_len in the printed table.
- Add an explicit warmup phase (_TIMING_WARMUP=3) before the timed loop
so the XPU runtime/JIT/caches are primed.
- Run more timed iterations (_TIMING_REPEATS=5) and report the
arithmetic mean (with the slowest sample trimmed) instead of a single
median, for steadier numbers across runs.
- Update _bench_one/_format_row/_print_header and the docstring
accordingly; FP/ARK/GPTQModel rows now emit one row per seq_len.