Fix FP8 padding: use torch.where instead of masked_fill_
masked_fill_ and indexing assignment are not implemented for
float8_e4m3fn. Use torch.where which supports FP8 tensors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>