Disagreement-Based Cross-Model Routing for Implicit Video… | AI Deep Signal

Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

arXiv cs.CV·Durga Sandeep Saluru

6/16/2026

·~2 min·6/16/2026·en·1

Quick Answer

The study introduces disagreement-based cross-model routing for video question answering, enhancing accuracy by 1.43% on the ImplicitQA benchmark using Gemini 3.1 Pro Preview and Claude Opus 4.8 models.

Quick Take

This method effectively identifies and routes challenging questions, achieving significant gains in categories reliant on cross-shot references.

Key Points

Disagreement-based routing improves video QA accuracy on ImplicitQA benchmark by +1.43%.
Utilizes Gemini 3.1 Pro Preview and Claude Opus 4.8 models for enhanced performance.
Significant gains observed in Motion & Trajectory (+5.49%) and Inferred Counting (+3.45%).
Method requires no labels or training, operating purely at inference time.
Achieved 82.03 AvgAcc on the CVPR 2026 ImplicitQA challenge test set.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 14723v1 Announce Type: new Abstract: We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout.

On this benchmark a single frontier video already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

3w ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI

Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

-Guided ANN Index Optimization for Human-Object Interaction Retrieval

Eddeep: a deep-learning framework for fast eddy-current distortion correction in diffusion MRI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

Eddeep: a deep-learning framework for fast eddy-current distortion correction in diffusion MRI

-Guided ANN Index Optimization for Human-Object Interaction Retrieval