Fix CPU 16 bytes alignment issue using equivalent fallback (#44970)
* fix 16 bytes alignement issue
* add issue for reference
* test fix for non-aligned inputs as well
* avoid dims non divisible by 8 for grouped_mm testing
* test
* style
* final fix that works for cpu builds as well
* move coment