---
license: other
license_name: nscl-v1
license_link: LICENSE
tags:
- image-generation
- class-conditional
- diffusion
- pixel-space
- dit
- imagenet
library_name: pytorch
pipeline_tag: unconditional-image-generation
---
PixelDiT: Pixel Diffusion Transformers for Image Generation
Yongsheng Yu1,2
Wei Xiong1†
Weili Nie1
Yichen Sheng1
Shiqiu Liu1
Jiebo Luo2
1NVIDIA 2University of Rochester
†Project Lead and Main Advising
## Model Overview
**PixelDiT-XL** (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details.
## Pre-trained Checkpoints
| Checkpoint | Resolution | Epochs | gFID | CFG Scale | Time Shift | CFG Interval |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| `imagenet256_pixeldit_xl_epoch80.ckpt` | 256x256 | 80 | **2.36** | 3.25 | 1.0 | [0.1, 1.0] |
| `imagenet256_pixeldit_xl_epoch160.ckpt` | 256x256 | 160 | **1.97** | 3.25 | 1.0 | [0.1, 1.0] |
| `imagenet256_pixeldit_xl_epoch320.ckpt` | 256x256 | 320 | **1.61** | 2.75 | 1.0 | [0.1, 0.9] |
| `imagenet512_pixeldit_xl.ckpt` | 512x512 | 850 | **1.78** | 3.5 | 2.0 | [0.1, 1.0] |
All evaluations use **FlowDPMSolver** with **100 steps**. 50K samples. Metrics follow the ADM evaluation protocol.
## Usage
### Installation
```bash
pip install torch torchvision lightning omegaconf timm wandb h5py
```
### Evaluation (Generate 50K Samples)
```bash
cd c2i/
# ImageNet 256x256 (epoch 320, best FID)
torchrun --nproc_per_node=8 main.py predict \
-c configs/pix256_xl.yaml \
--ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \
--model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
--model.diffusion_sampler.init_args.num_steps=100 \
--model.diffusion_sampler.init_args.guidance=2.75 \
--model.diffusion_sampler.init_args.timeshift=1.0 \
--model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
--model.diffusion_sampler.init_args.guidance_interval_max=0.9 \
--per_run_seed=false --seed_everything=1000
# ImageNet 512x512
torchrun --nproc_per_node=8 main.py predict \
-c configs/pix512_xl.yaml \
--ckpt_path=imagenet512_pixeldit_xl.ckpt \
--model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
--model.diffusion_sampler.init_args.num_steps=100 \
--model.diffusion_sampler.init_args.guidance=3.5 \
--model.diffusion_sampler.init_args.timeshift=2.0 \
--model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
--model.diffusion_sampler.init_args.guidance_interval_max=1.0 \
--per_run_seed=false --seed_everything=10000
```
After generating samples, compute FID with the [ADM evaluation toolkit](https://github.com/openai/guided-diffusion/tree/main/evaluations).
## Model Architecture
| Component | Value |
|-----------|-------|
| Parameters | 797M |
| Input channels | 3 (RGB) |
| Patch size | 16 |
| Hidden size | 1152 |
| Attention heads | 16 |
| Patch-level depth | 26 |
| Pixel-level depth | 4 |
| Pixel hidden size | 16 |
| Classes | 1000 (ImageNet) |
## Citation
```bibtex
@misc{yu2025pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
year={2025},
eprint={2511.20645},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.20645},
}
```
## License
This model is released under the [NVIDIA OneWay Non-Commercial License](LICENSE). The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.