JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators
Quick Answer
This paper shows that JuZhou 1.0 is an ultra-lightweight text-to-image model, trained entirely on Chinese AI accelerators, achieving a GenEval score of 0.69 with only 0.387B parameters.
Quick Take
JuZhou 1.0 is an ultra-lightweight text-to-image model, trained entirely on Chinese AI accelerators, achieving a GenEval score of 0.69 with only 0.387B parameters. It enables efficient on-device execution for mobile applications, outperforming larger models like SDXL and IF-XL while maintaining low latency and cost.
Key Points
- JuZhou 1.0 features a 0.387B parameter architecture for efficient text-to-image generation.
- It uses Rectified Flow training, reducing inference time to just 4 sampling steps.
- The model supports direct Chinese prompting, trained on 9M image-text pairs.
- On Snapdragon 8 Elite Gen 5, the U-Net denoising runs in 1.6 seconds.
- JuZhou 1.0 offers a practical solution for offline mobile text-to-image applications.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 28421v1 Announce Type: new Abstract: Text-to-image (T2I) diffusion models typically require substantial computational resources and cloud infrastructure, posing significant challenges for edge deployment in terms of latency, cost, and user privacy. We present JuZhou 1. 0, an ultra-lightweight T2I foundation model designed for fully offline, on-device execution. JuZhou 1. 0 achieves its efficiency through four key designs: (1) a compact image-generation backbone consisting of a 0.
385B-parameter denoising U-Net and a 1. 90M-parameter distilled decoder, totaling approximately 0.
387B parameters; (2) Rectified Flow training combined with DMD2 distillation, reducing inference to 4 sampling steps; (3) Chinese semantic alignment trained on 9M curated image-text pairs, enabling direct Chinese prompting without external translation at inference time; and (4) a training and distillation pipeline completed on domestically developed Sugon K100 AI accelerators without relying on NVIDIA GPUs for training or distillation. Despite its compact scale, the 28-step base model of JuZhou 1.
0 achieves an overall GenEval score of 0. 69, outperforming published baselines including SDXL (2. 6B, 0. 55), SD3-Medium (2B, 0. 62), and IF-XL (4. 3B, 0. 61). We further validate the full poetry-to-image pipeline on Android and the core CLIP-U-Net-VAE generation branch on iOS. On a smartphone powered by the Snapdragon 8 Elite Gen 5 Mobile Platform, the 4-step U-Net denoising branch runs in approximately 1. 6 seconds, while the full Android poetry-to-image pipeline takes 4.
5 seconds with on-device prompt refinement on Xiaomi 17 Pro Max. These results position JuZhou 1. 0 as a practical approach to mobile text-to-image generation and provide a concrete reference for Chinese-native generation, domestic-compute training, and fully offline on-device deployment after one-time installation.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.


