Representation Alignment for Just Image Transformers is not Easier than You Think
Abstract
Representation alignment fails for pixel-space diffusion transformers due to information asymmetry, but PixelREPA addresses this by transforming alignment targets and using masked transformer adapters to improve training convergence and image quality.
Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B/16 and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet 256 times 256, while achieving > 2times faster convergence. Finally, PixelREPA-H/16 achieves FID=1.81 and IS=317.2. Our code is available at https://github.com/kaist-cvml/PixelREPA.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing (2026)
- V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising (2026)
- End-to-End Training for Unified Tokenization and Latent Denoising (2026)
- DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment (2026)
- WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation (2026)
- PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss (2026)
- DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.14366 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper