QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning
Quick Answer
This paper shows that The QIAS 2026 shared task evaluates large language models' reasoning in Islamic inheritance, utilizing the MAWARITH dataset of 12,500 annotated cases.
Quick Take
The QIAS 2026 shared task evaluates large language models' reasoning in Islamic inheritance, utilizing the MAWARITH dataset of 12,500 annotated cases. Sixteen teams participated, revealing significant challenges in legal interpretation and numerical reasoning, with results indicating current models struggle with complex inheritance calculations.
Key Points
- QIAS 2026 is part of the OSACT7 Workshop at LREC 2026.
- The MAWARITH dataset includes 12,500 Arabic inheritance cases.
- Evaluation used MIR-E, measuring performance across inheritance reasoning stages.
- Sixteen teams explored various approaches, including prompting and fine-tuning.
- Current models struggle with precise legal interpretation and numerical reasoning.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13756v1 Announce Type: new Abstract: This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to perform complex reasoning in the religious and legal domain of Islamic inheritance.
Unlike conventional question-answering benchmarks, QIAS 2026 focuses on end-to-end reasoning from natural language cases, requiring systems to perform the full inheritance calculation process, from identifying the eligible heirs to assigning the correct share to each beneficiary. To support this evaluation, the task was based on the MAWARITH benchmark, a dataset of $12{,}500$ Arabic inheritance cases annotated with intermediate reasoning steps and final answers.
System submissions were evaluated using MIR-E, a multi-step metric that measures performance across the main stages of inheritance reasoning. A total of $16$ teams participated in the shared task, investigating a range of approaches, including prompting-based methods, , and fine-tuning strategies.
The results show that Islamic inheritance remains a highly challenging benchmark for current language models, especially in stages that require precise legal interpretation and structured numerical reasoning. This overview summarizes the task design, dataset, evaluation framework, participating systems, and main results.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.