TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
Quick Take
TIGER is a novel inference-time framework designed to mitigate hallucinations in multimodal generation by utilizing graph-based evidence routing. It independently assesses claims against input data, reducing unsupported content while maintaining task quality across four cross-modal paths, including image-to-text and audio-to-text. The method shows a geometric decrease in expected total risk and is effective across multiple model backbones.
Key Points
- TIGER redesigns feedback for localized repair in multimodal generation.
- It uses observation and claim graphs to assign risk scores to claims.
- The framework reduces unsupported claims while preserving output quality.
- Experiments show effectiveness across image-to-text and audio-to-text paths.
- Convergence analysis indicates a geometric decrease in expected total risk.
Article Content
From source RSS / original summaryarXiv:2606. 00232v1 Announce Type: new Abstract: We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level.
We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions.
Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
The In2AI solution introduces delayed per-step reward attribution for training language model agents in multi-agent environments, achieving top performance on the MindGames Arena benchmark at NeurIPS 2025. An 8-billion-parameter model outperformed larger proprietary systems, including GPT-5, in competitive play, demonstrating enhanced stability and sample efficiency in reinforcement learning.