Replace .item() device-host sync with loop index in _static_sample
next_pos.item() forces a device-to-host synchronization every decode
step to convert a device tensor to a Python int. Use the loop index
to derive cur_len = prefill_len + i + 2 instead, which is a pure
Python operation. Critical on Neuron (~40ms per sync), minor on CUDA.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>