[AMDGPU] Enable scheduler mfma rewrite stage by default (#180751)
After performance testing, it was determined that the large number of
copies that are inserted outside the loop are more than offset by better
allocation within the loop as a result of the rewrite. Additionally, there is a
minor cleanup of the cost logic.
---------
Co-authored-by: Tony Linthicum <tlinthic@gmail.com>