ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation
Quick Answer
This paper shows that The Efficient Continual Alignment (ECA) method enhances Open-ended Image-to-Text Generation by enabling models to adaptively align with evolving visual data while mitigating catastrophic forgetting.
Quick Take
The Efficient Continual Alignment (ECA) method enhances Open-ended Image-to-Text Generation by enabling models to adaptively align with evolving visual data while mitigating catastrophic forgetting. ECA employs a Mixture of Query module, Fisher Dynamic Expansion, and Dictionary Replay to retain knowledge without accessing previous raw data, showing significant performance improvements on newly constructed benchmarks.
Key Points
- ECA introduces continual alignment for adapting visual data in OpenITG.
- Utilizes Mixture of Query, Fisher Dynamic Expansion, and Dictionary Replay mechanisms.
- Significantly reduces catastrophic forgetting in incremental learning tasks.
- New benchmarks constructed reflect real-world scenarios for better evaluation.
- Code and benchmarks available on GitHub for further research.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12633v1 Announce Type: new Abstract: Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve.
In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks.
To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios.
Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github. com/Snowball0823/ECA.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.