Move EOS tracking to CPU in _static_sample to avoid blocking device sync
Move unfinished_sequences to CPU and replace stopping_criteria EOS check
with torch.isin(next_tokens.cpu(), eos_token_id_cpu). This replaces the
blocking max() device reduction with an async D2H copy + CPU-only
bookkeeping.
No measurable CUDA improvement (+0% on A10G with Llama-3.1-8B), but
eliminates a pipeline stall relevant on Neuron/XLA (~40ms per sync).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>