Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Quick Take
A systematic audit of FOLIO and MALLS datasets reveals 39% and 36% incorrect FOL formalizations, respectively. Corrected ground truths boost LLM accuracy by 9-22 percentage points, and an LLM-assisted framework enables achieving 90% dataset accuracy with less than 24% manual review.
Key Points
- 39% of FOLIO entries and 36% of MALLS entries have incorrect FOL formalizations.
- 16.4% of NL sentences in FOLIO are ambiguous, with 8.4% having incorrect NLI labels.
- Corrected ground truths improved LLM accuracy by 9 to 22 percentage points.
- The LLM-based framework allows achieving 90% accuracy with under 24% manual review.
- All human-verified annotations and framework code are publicly released.
Article Content
From source RSS / original summaryarXiv:2606. 02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited.
Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i. e. , ground truth labels), with additional rates of ambiguous NL sentences (16. 4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8. 4%).
Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets.
By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.