HiAR

Hierarchical Autoregressive Video Generation with Pipelined Parallel Inference

arXiv | Website | Code | Model


HiAR proposes hierarchical denoising for autoregressive video diffusion models, a paradigm shift from conventional block-first to step-first denoising order. By conditioning each block on context at a matched noise level, HiAR maximally attenuates error propagation while preserving temporal causality, achieving state-of-the-art long video generation (20s+) with significantly reduced quality drift.

Discussion & Limitations

In essence, this method allows autoregressive video generation to mimic a bidirectional attention video denoising paradigm. For instance, the high-noise denoising stages only require coarse-grained context information. This design maximally reduces error accumulation while theoretically retaining sufficient information to maintain continuity. By scaling the training budget under the constraint of the Forward KL loss, we can achieve near-zero degradation in most scenarios, even enabling infinite generation (e.g., over 200 minutes). However, in some dynamic scenes, inter-frame jumping may still occur. We believe this is not an inherent limitation of the hierarchical denoising paradigm itself, but rather an issue of insufficient capacity in the 1.3B base model, as this denoising paradigm is considerably more challenging. We plan to further validate this paradigm on more powerful base models in the future.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jackyhate/HiAR

Finetuned
(32)
this model

Paper for jackyhate/HiAR