i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models
Quick Answer
This paper shows that The i1 model, a 3B-parameter text-to-image diffusion model, outperforms existing fully open models by 29.5 percentage points on average across five benchmarks.
Quick Take
The i1 model, a 3B-parameter text-to-image diffusion model, outperforms existing fully open models by 29.5 percentage points on average across five benchmarks. Developed through 300+ experiments, i1 utilizes only publicly available datasets and provides a comprehensive open-source framework for future research in text-to-image generation.
Key Points
- i1 model trained with 700K+ TPU v6e hours using public datasets.
- Equal weighting of curated datasets proved effective for model training.
- Larger text encoder adapters enhance performance with minimal parameters.
- i1 is competitive with leading models on GenEval, DPG, PRISM, CVTG-2K, and LongText.
- Code, checkpoints, and data processing pipeline are publicly available.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11289v1 Announce Type: new Abstract: Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details.
The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e. g.
, equal weighting is a strong default for mixing curated datasets) and simple design decisions (e. g. , larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.
5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github. com/zlab-princeton/i1.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.