Replace scatter_ with direct indexing for output buffer in _static_sample
Since we have the write position as a Python int from the loop index,
use direct indexing (output_ids[:, prefill_len + i + 1] = next_tokens)
instead of scatter_ with tensor indices, which requires view/expand.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>