diffusers
d94ad81b - [AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up (#13792)

Commit
35 days ago
[AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up (#13792) * [AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up Follow-up to #13745. Extracts FAR mask construction to a module-level helper and adds an `attention_mask` forward kwarg so AnyFlowFARTransformer3DModel can be wrapped in `torch.compile(fullgraph=True)`. The pipeline pre-builds the mask during KV-cache prefill so users get end-to-end fullgraph compile. * Public method `AnyFlowFARTransformer3DModel.build_attention_mask(...)` (modes: "train", "cache") plus private module-level helper `_build_anyflow_far_causal_block_mask(...)`. * `_build_freqs` cache lookup/write bypassed under `torch.compiler.is_compiling()` to avoid a Dynamo guard recompile on the second compiled call (applied in bidi source; synced to FAR via `# Copied from`). * `TestAnyFlowFARTransformer3DCompile(TorchCompileTesterMixin)` — recompilation_and_graph_break, repeated_blocks, and group_offloading pass on H200; AOT is `@pytest.mark.skip`'d (torch.export rejects BlockMask as a pytree input). * Base `get_dummy_inputs` omits `attention_mask` so every non-compile test class exercises the in-forward fallback; the compile class overrides to inject a pre-built mask. * Bit-exact: pre-built path vs internal-build fallback max|Δ|=0.0e+00. * [AnyFlow] docs: full author list, repo demo examples, slimmer pipeline page * Full author list and NVIDIA → NUS → MIT institution order; TL;DR + abstract + Available Models bullets. * Rewritten pipeline-selection tip describing both pipelines symmetrically. * T2V / I2V / V2V examples now use the canonical 81-frame setup and the demo prompts / conditioning assets shipped under `NVlabs/AnyFlow/assets/evaluation/` (linked via raw.githubusercontent.com). * Drop the inline "Optimizing Memory" and "torch.compile" sections — those notes will live in the NVlabs/AnyFlow repo's own performance guide rather than the diffusers pipeline reference. * Sync zh user guide and the two model-API stubs. * [AnyFlow] FAR: move chunk_partition default into transformer config - AnyFlowFARTransformer3DModel.__init__ now accepts chunk_partition via @register_to_config (default (1, 3, 3, 3, 3, 3, 3, 2) for the released 81-frame checkpoints, matching the field on Hub). - AnyFlowFARPipeline.__call__ no longer requires chunk_partition; defaults to self.transformer.config.chunk_partition. Per-call override still supported for V2V / non-default num_frames. - Drop the AnyFlowFARPipeline.default_chunk_partition class attribute. - Update docs (en pipelines/models, zh using-diffusers) and the conversion script to match. * [AnyFlow] FAR pipeline: fix `timesteps` shadowing across chunks Inside the per-chunk rollout loop, the local variable `timesteps` was reassigned to `self.scheduler.timesteps` after `set_timesteps()`. On the next chunk iteration the same name was passed back into `set_timesteps(timesteps=...)`, where a non-None value enters the *custom-schedule* branch — `apply_shift` re-runs on already-shifted values, double-shifting the schedule for every chunk after the first. Concretely, with `shift=5` and `num_inference_steps=4`: - chunk 0 timesteps: [1000, 937.5, 833.3, 625] (correct) - chunk 1+ timesteps: [1000, 986.8, 961.3, 892.9] (double-shifted) The later steps drift toward `t=1000` instead of toward `t=0`, the flow-map model is conditioned on the wrong source sigma, and the chunk KV cache accumulates errors that show up as artifacts in later video frames. Fix: rebind the cached schedule to a fresh local name (`scheduler_timesteps`) so the outer-scope `timesteps` kwarg (the user-provided custom schedule, when any) stays untouched across chunks. Layer-by-layer verification against the NVlabs reference implementation on H200 (elephant prompt, seed 0, 4 NFE, 81 frames): - chunk 0 inference: bit-exact (0.0 mean diff) - chunk 1 step 0: 0.194 → 0.014 (-93%) - chunk 7 last step: 0.564 → 0.274 (-51%) * [AnyFlow] FAR: doc-builder line wrap for chunk_partition docstrings Pure rewrap to satisfy `doc-builder style --max_len 119`. Two docstrings introduced in 96077b2 (the `chunk_partition` config arg on the FAR transformer + the matching pipeline kwarg) wrapped a few characters short of the line budget. No semantic change. * [AnyFlow] docs: drop author names from docstrings, link FAR via HF papers, say chunk-wise - Remove author-name attributions from the transformer / pipeline class docstrings and file-header comments; the paper-citation header on the doc page keeps the full author list, the in-code references just point at the [AnyFlow] / [FAR] papers. - Link FAR via its Hugging Face papers page (https://huggingface.co/papers/2503.19325) instead of a raw arxiv.org URL, matching the AnyFlow reference style and the rest of the diffusers docs. - Describe AnyFlow FAR generation as "chunk-wise autoregressive": the pipeline autoregresses over chunks (`chunk_partition`), not single frames. * [AnyFlow] FAR: address review nits - pipeline: reuse the standard `timesteps` variable name for the per-chunk scheduler timesteps; freeze the caller-provided custom schedule in `custom_timesteps`/`custom_sigmas` before the loop so it isn't re-fed into `set_timesteps` and double-shifted on later chunks. - transformer: clarify the no-mask fallback comment to spell out the `torch.compile(fullgraph=True)` graph-break behavior and the `build_attention_mask` workaround. --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Author
Parents
Loading