Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

arXiv cs.CV·Van Quang Nguyen

4d ago

·~2 min·5/26/2026·en·0

Quick Take

The dissertation introduces GRIT, a transformer-based architecture for image captioning, outperforming previous models in accuracy and speed. It also presents LTMI for efficient visual dialog with significantly fewer parameters and a two-stage framework for interactive instruction-following, achieving an 8.37% unseen success rate on the ALFRED dataset.

Key Points

GRIT integrates grid and region features for improved image captioning performance.
LTMI achieves comparable representational power with less than 10% of standard Transformer parameters.
The interactive instruction-following framework uses a two-stage interpretation process.
The method localizes objects accurately using multiple egocentric views.
Achieved a state-of-the-art unseen success rate of 8.37% on the ALFRED dataset.

Article Content

From source RSS / original summary

arXiv:2605. 24020v1 Announce Type: new Abstract: Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning.

Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed. Second, we tackle visual dialog, which requires multi-turn conversation about an image.

The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset. Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset.

We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8. 37%.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source