StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

arXiv cs.CV·Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

1d ago

·~2 min·5/19/2026·en·2

Quick Take

StrLoRA introduces a framework for Streaming Continual Visual Instruction Tuning in MLLMs, enhancing adaptability to evolving tasks.

Key Points

StrCVIT allows models to learn from dynamic task streams.
StrLoRA uses expert routing to reduce task interference.
Extensive experiments show significant performance improvements.

📖 Reader Mode

~2 min read

[Submitted on 8 May 2026]

View PDF HTML (experimental)

Abstract:Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.16353 [cs.CV]
	(or arXiv:2605.16353v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.16353 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chang Che [view email]
[v1] Fri, 8 May 2026 06:16:37 UTC (2,290 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets