xla
Use f32 scratch for output so we only need to transfer output with desired dtype back to HBM.
#8924
Merged

Loading