Language-Guided Abstraction for Visual Reasoning
Quick Answer
The L-VARC framework enhances visual reasoning by integrating language guidance through a Learning Using Privileged Information (LUPI) branch, achieving superior performance with only 18 million parameters.
Quick Take
The L-VARC framework enhances visual reasoning by integrating language guidance through a Learning Using Privileged Information (LUPI) branch, achieving superior performance with only 18 million parameters. Extensive experiments show that L-VARC outperforms existing models on the (ARC), refining raw language descriptions and aligning visual features with semantic embeddings.
Key Points
- L-VARC utilizes a Semantic Compression Module to refine language descriptions.
- The framework incorporates a Cross-Attention Projector for visual-semantic alignment.
- L-VARC achieves state-of-the-art performance on the ARC benchmark.
- The model is lightweight, with only 18 million parameters.
- Ablation studies confirm the effectiveness of the new design components.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12847v1 Announce Type: new Abstract: The (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i. e. , VARC). The former depends heavily on LLMs, consuming billions of parameters.
The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3.
In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e. g. , CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters.
Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github. com/GZHU-DVL/L-VARC.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.