DiffiT: Diffusion Vision Transformers for Image Generation

This repository hosts the pretrained model weights for DiffiT (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.

Overview

DiffiT (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing Time-dependent Multihead Self Attention (TMSA) for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an FID score of 1.73 on ImageNet-256 while using 19.85% and 16.88% fewer parameters than comparable Transformer-based diffusion models such as MDT and DiT, respectively.

Models

ImageNet-256

Model	Dataset	Resolution	FID-50K	Inception Score	Download
DiffiT	ImageNet	256×256	1.73	276.49	model

ImageNet-512

Model	Dataset	Resolution	FID-50K	Inception Score	Download
DiffiT	ImageNet	512×512	2.67	252.12	model

Usage

Please refer to the official GitHub repository for full setup instructions, training code, and evaluation scripts.

Sampling Images

Image sampling is performed using sample.py from the DiffiT repository. To reproduce the reported numbers, use the commands below.

ImageNet-256:

python sample.py \
    --log_dir $LOG_DIR \
    --cfg_scale 4.4 \
    --model_path $MODEL \
    --image_size 256 \
    --model Diffit \
    --num_sampling_steps 250 \
    --num_samples 50000 \
    --cfg_cond True

ImageNet-512:

python sample.py \
    --log_dir $LOG_DIR \
    --cfg_scale 1.49 \
    --model_path $MODEL \
    --image_size 512 \
    --model Diffit \
    --num_sampling_steps 250 \
    --num_samples 50000 \
    --cfg_cond True

Evaluation

Once images have been sampled, you can compute FID and other metrics using the provided eval_run.sh script in the repository. The evaluation pipeline follows the protocol from openai/guided-diffusion/evaluations.

bash eval_run.sh

Citation

@inproceedings{hatamizadeh2025diffit,
  title={Diffit: Diffusion vision transformers for image generation},
  author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
  booktitle={European Conference on Computer Vision},
  pages={37--55},
  year={2025},
  organization={Springer}
}

License

The code is released under the NVIDIA Source Code License-NC. The pretrained models are shared under CC-BY-NC-SA-4.0. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nvidia/DiffiT

Paper for nvidia/DiffiT

DiffiT: Diffusion Vision Transformers for Image Generation

Paper • 2312.02139 • Published Dec 4, 2023 • 15