AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
Quick Take
AuditFlow introduces a multi-agent framework for structured financial reporting verification, achieving 82.09% accuracy with GPT-5.5, outperforming the baseline by 14.93 points. It utilizes a symbolic environment for effective audit processes, demonstrating the necessity of deterministic checks for reliable verification.
Key Points
- AuditFlow separates adaptive search from deterministic verification for improved accuracy.
- The framework uses US-GAAP taxonomy and XBRL filing graphs for structured audits.
- Two junior auditors and a senior auditor collaborate to resolve discrepancies.
- Final reports include an audit verdict, expected value, evidence trail, and trustworthiness score.
- Removing deterministic checks reduces accuracy to 17.91%, highlighting their importance.
Article Content
From source RSS / original summaryarXiv:2606. 03031v1 Announce Type: new Abstract: Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification.
AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation.
The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82. 09% joint audit accuracy under GPT-5. 5, outperforming the strongest baseline by 14. 93 points. Removing deterministic checks drops accuracy to 17. 91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
The In2AI solution introduces delayed per-step reward attribution for training language model agents in multi-agent environments, achieving top performance on the MindGames Arena benchmark at NeurIPS 2025. An 8-billion-parameter model outperformed larger proprietary systems, including GPT-5, in competitive play, demonstrating enhanced stability and sample efficiency in reinforcement learning.