Personal AI Agent for Camera Roll VQA

arXiv cs.CV·Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

2d ago

·~1 min·6/5/2026·en·2

Quick Answer

This paper shows that The camroll-agent, a conversational AI assistant, effectively navigates personalized camera rolls for visual question answering, outperforming existing models with a dataset of 31,476 images and 2,500 QA pairs.

Quick Take

The camroll-agent, a conversational AI assistant, effectively navigates personalized camera rolls for visual question answering, outperforming existing models with a dataset of 31,476 images and 2,500 QA pairs. This highlights the need for tailored approaches in AI long-context reasoning, especially in visual memory.

Key Points

The camroll dataset includes 50 users and 31,476 annotated images.
2,500 QA pairs were created to mimic real-world usage scenarios.
Camroll-agent uses hierarchical memory for efficient navigation.
Experimental results show superior performance over baseline models.
Personalized visual memory requires distinct approaches from textual memory.

Article Content

From source RSS / original summary

arXiv:2606. 05275v1 Announce Type: new Abstract: We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e. g. , ``Name of the food I tried yesterday? '') to more open-ended ones (e. g. , ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i. e.

, multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs.

We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system.

Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

2d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup