PixelDiT-ImageNet / README.md
yongshengy's picture
Upload README.md with huggingface_hub
08f246c verified
|
raw
history blame
4.75 kB
metadata
license: other
license_name: nscl-v1
license_link: LICENSE
tags:
  - image-generation
  - class-conditional
  - diffusion
  - pixel-space
  - dit
  - imagenet
library_name: pytorch
pipeline_tag: unconditional-image-generation

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu1,2   Wei Xiong1†   Weili Nie1   Yichen Sheng1   Shiqiu Liu1   Jiebo Luo2

1NVIDIA   2University of Rochester
Project Lead and Main Advising

   

Model Overview

PixelDiT-XL (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in pixel space — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details.

Pre-trained Checkpoints

Checkpoint Resolution Epochs gFID CFG Scale Time Shift CFG Interval
imagenet256_pixeldit_xl_epoch80.ckpt 256x256 80 2.36 3.25 1.0 [0.1, 1.0]
imagenet256_pixeldit_xl_epoch160.ckpt 256x256 160 1.97 3.25 1.0 [0.1, 1.0]
imagenet256_pixeldit_xl_epoch320.ckpt 256x256 320 1.61 2.75 1.0 [0.1, 0.9]
imagenet512_pixeldit_xl.ckpt 512x512 850 1.78 3.5 2.0 [0.1, 1.0]

All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow the ADM evaluation protocol.

Usage

Installation

pip install torch torchvision lightning omegaconf timm wandb h5py

Evaluation (Generate 50K Samples)

cd c2i/

# ImageNet 256x256 (epoch 320, best FID)
torchrun --nproc_per_node=8 main.py predict \
  -c configs/pix256_xl.yaml \
  --ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \
  --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
  --model.diffusion_sampler.init_args.num_steps=100 \
  --model.diffusion_sampler.init_args.guidance=2.75 \
  --model.diffusion_sampler.init_args.timeshift=1.0 \
  --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
  --model.diffusion_sampler.init_args.guidance_interval_max=0.9 \
  --per_run_seed=false --seed_everything=1000

# ImageNet 512x512
torchrun --nproc_per_node=8 main.py predict \
  -c configs/pix512_xl.yaml \
  --ckpt_path=imagenet512_pixeldit_xl.ckpt \
  --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
  --model.diffusion_sampler.init_args.num_steps=100 \
  --model.diffusion_sampler.init_args.guidance=3.5 \
  --model.diffusion_sampler.init_args.timeshift=2.0 \
  --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
  --model.diffusion_sampler.init_args.guidance_interval_max=1.0 \
  --per_run_seed=false --seed_everything=10000

After generating samples, compute FID with the ADM evaluation toolkit.

Model Architecture

Component Value
Parameters 797M
Input channels 3 (RGB)
Patch size 16
Hidden size 1152
Attention heads 16
Patch-level depth 26
Pixel-level depth 4
Pixel hidden size 16
Classes 1000 (ImageNet)

Citation

@misc{yu2025pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      year={2025},
      eprint={2511.20645},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.20645},
}

License

This model is released under the NVIDIA OneWay Non-Commercial License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.