license: other
license_name: nscl-v1
license_link: LICENSE
tags:
- image-generation
- class-conditional
- diffusion
- pixel-space
- dit
- imagenet
library_name: pytorch
pipeline_tag: unconditional-image-generation
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yongsheng Yu1,2 Wei Xiong1† Weili Nie1 Yichen Sheng1 Shiqiu Liu1 Jiebo Luo2
1NVIDIA 2University of Rochester
†Project Lead and Main Advising
Model Overview
PixelDiT-XL (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in pixel space — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details.
Pre-trained Checkpoints
| Checkpoint | Resolution | Epochs | gFID | CFG Scale | Time Shift | CFG Interval |
|---|---|---|---|---|---|---|
imagenet256_pixeldit_xl_epoch80.ckpt |
256x256 | 80 | 2.36 | 3.25 | 1.0 | [0.1, 1.0] |
imagenet256_pixeldit_xl_epoch160.ckpt |
256x256 | 160 | 1.97 | 3.25 | 1.0 | [0.1, 1.0] |
imagenet256_pixeldit_xl_epoch320.ckpt |
256x256 | 320 | 1.61 | 2.75 | 1.0 | [0.1, 0.9] |
imagenet512_pixeldit_xl.ckpt |
512x512 | 850 | 1.78 | 3.5 | 2.0 | [0.1, 1.0] |
All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow the ADM evaluation protocol.
Usage
Installation
pip install torch torchvision lightning omegaconf timm wandb h5py
Evaluation (Generate 50K Samples)
cd c2i/
# ImageNet 256x256 (epoch 320, best FID)
torchrun --nproc_per_node=8 main.py predict \
-c configs/pix256_xl.yaml \
--ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \
--model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
--model.diffusion_sampler.init_args.num_steps=100 \
--model.diffusion_sampler.init_args.guidance=2.75 \
--model.diffusion_sampler.init_args.timeshift=1.0 \
--model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
--model.diffusion_sampler.init_args.guidance_interval_max=0.9 \
--per_run_seed=false --seed_everything=1000
# ImageNet 512x512
torchrun --nproc_per_node=8 main.py predict \
-c configs/pix512_xl.yaml \
--ckpt_path=imagenet512_pixeldit_xl.ckpt \
--model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
--model.diffusion_sampler.init_args.num_steps=100 \
--model.diffusion_sampler.init_args.guidance=3.5 \
--model.diffusion_sampler.init_args.timeshift=2.0 \
--model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
--model.diffusion_sampler.init_args.guidance_interval_max=1.0 \
--per_run_seed=false --seed_everything=10000
After generating samples, compute FID with the ADM evaluation toolkit.
Model Architecture
| Component | Value |
|---|---|
| Parameters | 797M |
| Input channels | 3 (RGB) |
| Patch size | 16 |
| Hidden size | 1152 |
| Attention heads | 16 |
| Patch-level depth | 26 |
| Pixel-level depth | 4 |
| Pixel hidden size | 16 |
| Classes | 1000 (ImageNet) |
Citation
@misc{yu2025pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
year={2025},
eprint={2511.20645},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.20645},
}
License
This model is released under the NVIDIA OneWay Non-Commercial License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.