Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Quick Take
The POLAR framework enhances personalized embodied agents by utilizing a multimodal memory-augmented approach, improving task execution through accumulated user interactions. Evaluations across various MLLM backbones demonstrate significant performance gains, particularly in reasoning and context tracking over time.
Key Points
- POLAR organizes prior interactions into a multimodal knowledge graph.
- The framework captures both semantic and episodic memory for personalized context.
- Performance improvements are notable in multi-hop inference and reasoning tasks.
- Memory mechanism enables effective use of information from prior interactions.
- Evaluated across multiple MLLM backbones and diverse scenarios.
Article Content
From source RSS / original summaryarXiv:2605. 26256v1 Announce Type: new Abstract: Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time.
In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution.
We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.
