--- license: other license_name: nscl-v1 license_link: LICENSE tags: - image-generation - class-conditional - diffusion - pixel-space - dit - imagenet library_name: pytorch pipeline_tag: unconditional-image-generation ---

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu^1,2 Wei Xiong^1† Weili Nie¹ Yichen Sheng¹ Shiqiu Liu¹ Jiebo Luo²

¹NVIDIA ²University of Rochester
^†Project Lead and Main Advising

## Model Overview **PixelDiT-XL** (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details. ## Pre-trained Checkpoints | Checkpoint | Resolution | Epochs | gFID | CFG Scale | Time Shift | CFG Interval | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | `imagenet256_pixeldit_xl_epoch80.ckpt` | 256x256 | 80 | **2.36** | 3.25 | 1.0 | [0.1, 1.0] | | `imagenet256_pixeldit_xl_epoch160.ckpt` | 256x256 | 160 | **1.97** | 3.25 | 1.0 | [0.1, 1.0] | | `imagenet256_pixeldit_xl_epoch320.ckpt` | 256x256 | 320 | **1.61** | 2.75 | 1.0 | [0.1, 0.9] | | `imagenet512_pixeldit_xl.ckpt` | 512x512 | 850 | **1.78** | 3.5 | 2.0 | [0.1, 1.0] | All evaluations use **FlowDPMSolver** with **100 steps**. 50K samples. Metrics follow the ADM evaluation protocol. ## Usage ### Installation ```bash pip install torch torchvision lightning omegaconf timm wandb h5py ``` ### Evaluation (Generate 50K Samples) ```bash cd c2i/ # ImageNet 256x256 (epoch 320, best FID) torchrun --nproc_per_node=8 main.py predict \ -c configs/pix256_xl.yaml \ --ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \ --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \ --model.diffusion_sampler.init_args.num_steps=100 \ --model.diffusion_sampler.init_args.guidance=2.75 \ --model.diffusion_sampler.init_args.timeshift=1.0 \ --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \ --model.diffusion_sampler.init_args.guidance_interval_max=0.9 \ --per_run_seed=false --seed_everything=1000 # ImageNet 512x512 torchrun --nproc_per_node=8 main.py predict \ -c configs/pix512_xl.yaml \ --ckpt_path=imagenet512_pixeldit_xl.ckpt \ --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \ --model.diffusion_sampler.init_args.num_steps=100 \ --model.diffusion_sampler.init_args.guidance=3.5 \ --model.diffusion_sampler.init_args.timeshift=2.0 \ --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \ --model.diffusion_sampler.init_args.guidance_interval_max=1.0 \ --per_run_seed=false --seed_everything=10000 ``` After generating samples, compute FID with the [ADM evaluation toolkit](https://github.com/openai/guided-diffusion/tree/main/evaluations). ## Model Architecture | Component | Value | |-----------|-------| | Parameters | 797M | | Input channels | 3 (RGB) | | Patch size | 16 | | Hidden size | 1152 | | Attention heads | 16 | | Patch-level depth | 26 | | Pixel-level depth | 4 | | Pixel hidden size | 16 | | Classes | 1000 (ImageNet) | ## Citation ```bibtex @misc{yu2025pixeldit, title={PixelDiT: Pixel Diffusion Transformers for Image Generation}, author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo}, year={2025}, eprint={2511.20645}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.20645}, } ``` ## License This model is released under the [NVIDIA OneWay Non-Commercial License](LICENSE). The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.