JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

arXiv cs.CV·Ce Chen, Congrui Wang, Yonglin Li, Zhenchen Wan, Mingyang Geng, Junhao Xiao, Zhengpeng Xing, Yaqing Hu, Yao Wu, Zhaoyang Qu, Long Lan, Xinwang Liu, Yingqi Peng, Shijia Li, Zufeng Zhang, Chen Ma, Jingjing Zhou, Xingyu Wang, Qilin Lu, Bin Jiang, Qilin Sun, Shanzhi Gu, Yaoguang Jin, Tongliang Liu, Kede Ma, Yifan Peng

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

This paper shows that JuZhou 1.0 is an ultra-lightweight text-to-image model, trained entirely on Chinese AI accelerators, achieving a GenEval score of 0.69 with only 0.387B parameters.

Quick Take

JuZhou 1.0 is an ultra-lightweight text-to-image model, trained entirely on Chinese AI accelerators, achieving a GenEval score of 0.69 with only 0.387B parameters. It enables efficient on-device execution for mobile applications, outperforming larger models like SDXL and IF-XL while maintaining low latency and cost.

Key Points

JuZhou 1.0 features a 0.387B parameter architecture for efficient text-to-image generation.
It uses Rectified Flow training, reducing inference time to just 4 sampling steps.
The model supports direct Chinese prompting, trained on 9M image-text pairs.
On Snapdragon 8 Elite Gen 5, the U-Net denoising runs in 1.6 seconds.
JuZhou 1.0 offers a practical solution for offline mobile text-to-image applications.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 28421v1 Announce Type: new Abstract: Text-to-image (T2I) diffusion models typically require substantial computational resources and cloud infrastructure, posing significant challenges for edge deployment in terms of latency, cost, and user privacy. We present JuZhou 1. 0, an ultra-lightweight T2I foundation model designed for fully offline, on-device execution. JuZhou 1. 0 achieves its efficiency through four key designs: (1) a compact image-generation backbone consisting of a 0.

385B-parameter denoising U-Net and a 1. 90M-parameter distilled decoder, totaling approximately 0.

387B parameters; (2) Rectified Flow training combined with DMD2 distillation, reducing inference to 4 sampling steps; (3) Chinese semantic alignment trained on 9M curated image-text pairs, enabling direct Chinese prompting without external translation at inference time; and (4) a training and distillation pipeline completed on domestically developed Sugon K100 AI accelerators without relying on NVIDIA GPUs for training or distillation. Despite its compact scale, the 28-step base model of JuZhou 1.

0 achieves an overall GenEval score of 0. 69, outperforming published baselines including SDXL (2. 6B, 0. 55), SD3-Medium (2B, 0. 62), and IF-XL (4. 3B, 0. 61). We further validate the full poetry-to-image pipeline on Android and the core CLIP-U-Net-VAE generation branch on iOS. On a smartphone powered by the Snapdragon 8 Elite Gen 5 Mobile Platform, the 4-step U-Net denoising branch runs in approximately 1. 6 seconds, while the full Android poetry-to-image pipeline takes 4.

5 seconds with on-device prompt refinement on Xiaomi 17 Pro Max. These results position JuZhou 1. 0 as a practical approach to mobile text-to-image generation and provide a concrete reference for Chinese-native generation, domestic-compute training, and fully offline on-device deployment after one-time installation.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup

JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

Curvature-Guided Mixing for MLLM Adaptation

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Related in this space

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw