HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

arXiv cs.CV·Haoran You, Yotam Nitzan, Lingzhi Zhang, Yifan Gong, Mang-Tik Chiu, Connelly Barnes, Yan Kang, Yuqian Zhou, Eli Shechtman, Sohrab Amirghodsi

6h ago

·~2 min·6/15/2026·en·1

Quick Answer

Quick Take

The HiLo-Token framework enhances image editing efficiency by adapting token allocation based on input frequency, achieving up to 3.13x speedups in Diffusion Transformers (DiTs) without compromising quality. This method addresses the 73% latency issue in DiTs, particularly in high-frequency editing areas, significantly benefiting Photoshop and Lightroom users.

Key Points

HiLo-Token allocates more tokens to high-frequency regions for better context retention.
Achieves 3.13x, 2.59x, and 1.67x speedups on A100-80GB for varying mask ratios.
DiT module contributes 73% of total latency, highlighting the need for optimization.
Utilizes a high-frequency token selection strategy to capture essential local details.
Maintains generation quality while improving processing speed for image editing tasks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 13898v1 Announce Type: new Abstract: Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs).

In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas.

Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure.

Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3. 13x, 2. 59x, and 1. 67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6. 38%, 15. 92%, and 35. 36%, respectively, without any regression in generation quality.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup