refactor(ark): drop INT8 asym DPAS; add INT4/INT2 sym via INT8 DPAS
Roll back the INT8 asym DPAS path (perf regressed vs. dequant fallback
on hardware). Add INT4-sym and INT2-sym prefill paths that upcast the
packed weights into an int8_t [E, N, K] view inside the existing dequant
workspace and dispatch through the same per-group INT8 DPAS mainloop
the S8-sym branch uses, reusing the packed scale tensor unmodified.