Token-to-Token Alignment of Text Embeddings for Semantic Blending
Quick Answer
This paper shows that The Token-to-Token alignment framework enhances semantic blending in generative models by establishing explicit semantic correspondences between tokens across text prompts.
Quick Take
The Token-to-Token alignment framework enhances semantic blending in generative models by establishing explicit semantic correspondences between tokens across text prompts. This method allows for smooth transitions and coherent edits in image generation, revealing a continuous semantic structure in text embeddings that can be leveraged without altering the generative model.
Key Points
- Introduces a framework for aligning token embeddings across prompts.
- Enables smooth transitions in image generation through linear interpolation.
- Reveals a continuous semantic structure in text-to-image models.
- Improves applications like image blending and continuous editing.
- Aligns representations rather than modifying the generative model.
Paper Resources
📖 Reader Mode
~2 min readAbstract:In modern generative models, images are specified and controlled through text prompts. In practice, images are generated from sequences of tokens derived from these prompts. However, the space of token sequences lacks a consistent accessible structure: semantically similar images may correspond to sequences that differ in wording, ordering, and placement of concepts, while similar token sequences may encode very different semantics. This apparent lack of structure makes it difficult to perform smooth transitions in this space, hindering applications such as image blending and continuous control of edits. We argue that this limitation stems not from the absence of semantic structure, but from misalignment between representations. To address this misalignment, we introduce Token-to-Token alignment, a framework that establishes explicit semantic correspondence between tokens across prompts. Our approach transforms prompts into a structured representation in which semantically corresponding concepts are mapped to consistent positions across prompts, and then aligns their token embeddings based on semantic similarity. Concretely, the method consists of two stages: a structural alignment that rephrases prompts into a shared structured form, followed by an embedding-level alignment that matches token representations across prompts. With this alignment in place, simple linear interpolation becomes a meaningful operation, producing smooth and coherent semantic transitions and enabling applications such as blending and continuous editing. Our results show that text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned, suggesting that semantic control can be achieved by organizing existing representations rather than modifying the generative model.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) |
| Cite as: | arXiv:2606.24021 [cs.CV] |
| (or arXiv:2606.24021v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.24021 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Saar Huberman [view email]
[v1]
Mon, 22 Jun 2026 23:54:40 UTC (34,006 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.