Disagreement-Based Cross-Model Routing for Implicit Video Question Answering
Quick Answer
The study introduces disagreement-based cross-model routing for video question answering, enhancing accuracy by 1.43% on the ImplicitQA benchmark using Gemini 3.1 Pro Preview and Claude Opus 4.8 models.
Quick Take
The study introduces disagreement-based cross-model routing for video question answering, enhancing accuracy by 1.43% on the ImplicitQA benchmark using Gemini 3.1 Pro Preview and Claude Opus 4.8 models. This method effectively identifies and routes challenging questions, achieving significant gains in categories reliant on cross-shot references.
Key Points
- Disagreement-based routing improves video QA accuracy on ImplicitQA benchmark by +1.43%.
- Utilizes Gemini 3.1 Pro Preview and Claude Opus 4.8 models for enhanced performance.
- Significant gains observed in Motion & Trajectory (+5.49%) and Inferred Counting (+3.45%).
- Method requires no labels or training, operating purely at inference time.
- Achieved 82.03 AvgAcc on the CVPR 2026 ImplicitQA challenge test set.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14723v1 Announce Type: new Abstract: We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout.
On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.
1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4. 8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.
43 over the best single sample of the primary model, with per-category gains concentrated on Motion & Trajectory (+5. 49), Inferred Counting (+3. 45), and Vertical Spatial Reasoning (+1. 82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82. 03 AvgAcc / 79. 71 MacroAvgAcc (+1. 81 over the best single sample of the primary model), confirming the validation result on an independent split.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.