Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

arXiv cs.CV·Hongxi Li, Tong Wang, Chengjing Wu, Tianbao Liu, Jiangtao Yao, Xiaochao Qu, Xinxiao Wu, Luoqi Liu, Ting Liu

4d ago

·~2 min·5/18/2026·en·2

Quick Take

A self-prompting diffusion transformer enables open-vocabulary scene text editing with style consistency.

Key Points

Constructs style and glyph prompts from original images.
Utilizes a two-stage training strategy for refinement.
Achieves state-of-the-art performance in text accuracy.

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: \href{this https URL}{this http URL}.

Comments:	ICML 2026
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.15523 [cs.CV]
	(or arXiv:2605.15523v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.15523 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Tong Wang [view email]
[v1] Fri, 15 May 2026 01:44:17 UTC (23,017 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets