FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning
Quick Take
FAST-GOAL enhances CLIP's performance on lengthy text descriptions through global-local semantic alignment, achieving significant improvements on long caption datasets like DOCCI and DCI. The method includes Fast Local Image-Sentence Matching and Token Similarity-based Learning, demonstrating efficiency in adapting to detailed textual descriptions while maintaining computational efficiency.
Key Points
- Introduces FAST-GOAL for fine-tuning CLIP on lengthy text descriptions.
- Utilizes Fast Local Image-Sentence Matching for efficient local region extraction.
- Employs Token Similarity-based Learning to enhance detailed correspondence capture.
- Demonstrates significant performance improvements on DOCCI and DCI datasets.
- Maintains computational efficiency while adapting to complex textual inputs.
Article Content
From source RSS / original summaryarXiv:2605. 26615v1 Announce Type: new Abstract: Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components.
First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences.
Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence.
Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.