Which Models Perform Better in Inheritance Reasoning?
Quick Answer
The study evaluates large language models in Arabic Islamic inheritance reasoning, revealing that commercial models outperform open-source ones.
Quick Take
The study evaluates large language models in Arabic Islamic inheritance reasoning, revealing that commercial models outperform open-source ones. Notably, Gemini 2.5 Flash achieved the highest reliability with an MRE of 0.989, excelling in identifying heirs and applying legal rules.
Key Points
- Commercial models show superior performance in structured legal reasoning tasks.
- Gemini 2.5 Flash achieved an MRE of 0.989, the highest in the study.
- Open-source models demonstrated instability, especially in complex legal scenarios.
- The evaluation highlights a significant reliability gap between model families.
- Effective legal interpretation requires multi-step reasoning and precise computation.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 13751v1 Announce Type: new Abstract: This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation.
We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps.
In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2. 5 Flash}, with an MRE of $0. 989$.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.