[LLADA2] Fix llada2 review #13598 (#13698)
* [LLaDA2] address review findings from #13598
Fixes the six in-scope issues raised in the llada2 model/pipeline review:
1. Carry tokenizer `attention_mask` through `_prepare_input_ids` and add an
`attention_mask` arg to `__call__` for pre-tokenized inputs. The runtime
mask now reflects prompt padding and zeros out the block-aligned tail
past `prompt_length + gen_length` instead of treating those positions
as valid context.
2. Thread the per-call `block_length` into `BlockRefinementScheduler.set_timesteps`
so the transfer schedule matches the requested block size (previously the
scheduler only read its constructor default).
3. Drop `x0`/`x0_p`/`confidence` from `_callback_tensor_inputs` (never bound
locals) and bind `sampled_tokens`, `sampled_probs`, `editing_transfer_index`,
`active_block` so all advertised callback keys resolve.
4. Allow EOS exactly at index `prompt_length` (the first generated position)
to mark a row finished.
5. Freeze rows that have already emitted EOS so subsequent block refinement
doesn't extend them, and trim per-row at decode (previously gated on
batch_size==1) so post-EOS positions don't leak into decoded text.
6. Stop calling `self.set_progress_bar_config(...)` from inside `__call__`;
build a local config dict for the inner block bar so user-supplied flags
(in particular `disable=True`) survive the call.
Adds regression tests pinning each of the six fixes.
* fix formatting
* undo changes
* set block_length to optional and use scheduler's default
---------
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>