Move output_ids and EOS masking to CPU in _static_sample
Move the output buffer (output_ids) to CPU and perform all token
bookkeeping on CPU to avoid device-side ops that would trigger NEFF
recompilation on Neuron. Changes:
- output_ids allocated on CPU, prompt copied via input_ids.cpu()
- current_ids (device-side input buffer) updated via .copy_() from CPU
- EOS masking done entirely on CPU (no unfinished_sequences.to(device))
- logits_processor receives full output_ids buffer (static shape)
- output_ids moved back to device before return
On CUDA (A10G, Llama-3.1-8B, 256 tokens): +1.4% vs _sample baseline.
Sanity check: PASSED (identical greedy output).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>