Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

arXiv cs.CL·Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

2h ago

·~1 min·6/3/2026·en·0

Quick Take

A systematic audit of FOLIO and MALLS datasets reveals 39% and 36% incorrect FOL formalizations, respectively. Corrected ground truths boost LLM accuracy by 9-22 percentage points, and an LLM-assisted framework enables achieving 90% dataset accuracy with less than 24% manual review.

Key Points

39% of FOLIO entries and 36% of MALLS entries have incorrect FOL formalizations.
16.4% of NL sentences in FOLIO are ambiguous, with 8.4% having incorrect NLI labels.
Corrected ground truths improved LLM accuracy by 9 to 22 percentage points.
The LLM-based framework allows achieving 90% accuracy with under 24% manual review.
All human-verified annotations and framework code are publicly released.

Article Content

From source RSS / original summary

arXiv:2606. 02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited.

Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i. e. , ground truth labels), with additional rates of ambiguous NL sentences (16. 4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8. 4%).

Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets.

By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy