Fix fp8 gemm (#7265)
This PR addresses this issue
https://github.com/deepspeedai/DeepSpeed/issues/7236.
I might have reverted some of the recent changes introduced in this
[PR](https://github.com/deepspeedai/DeepSpeed/pull/6932), which was
necessary to remove a misaligned address issue on the CUDA kernel. I
will get back to this and try to make the necessary changes for the
other pass.
cc: @mrwyattii @jeffra
---------
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Reza Yazdani <rezay@microsoft.com>
Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>