[Mosaic GPU] Add conversion logic for `i4 -> f8e4m3fn`.
The inline PTX is able to upcast 4 values at a time, so we use a generator to
pack several registers together when our registers don't hold enough packed
values. This makes the generated PTX smaller, since the conversion routine
needs to be called less often.
PiperOrigin-RevId: 770640667