CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law
Quick Take
CanLegalRAGBench introduces a Canadian legal QA benchmark focusing on realistic queries and expert-annotated answers, revealing that 8-29% of generated claims lack support from retrieved documents. The study highlights the sensitivity of retrieval performance to design choices and the competitiveness of open-source embedding models against closed-source ones.
Key Points
- CanLegalRAGBench addresses the underrepresentation of Canadian law in legal evaluations.
- Retrieval performance is sensitive to design choices in the benchmark.
- Open-source embedding models perform competitively with closed-source models.
- Generated answers often diverge from gold responses, with significant hallucinations.
- 8-29% of claims made by systems are unsupported by retrieved documents.
Article Excerpt
From source RSS / original summaryarXiv:2605. 30497v1 Announce Type: new Abstract: RAG-based legal assistants have been growing in popularity, but LLM hallucinations remain a key issue and potentially undermines justice. While benchmarks have been developed to evaluate progress, many rely on synthetic queries rather than realistic legal scenarios. Moreover, Canadian law remains underrepresented in existing evaluations.
To address this gap, we introduce CanLegalRAGBench, a Canadian legal QA benchmark based on realistic queries and expert-annotated answers grounded in case law. Our evaluation shows that retrieval performance is sensitive to design choices and that open-source embedding models are competitive with closed source models. However, it also reveals the limitation of automatic evaluations that penalize systems for retrieving alternative relevant documents.
We also find that generated answers often diverge from gold responses, either with hallucinations or by producing overly detailed or irrelevant content, with 8-29% of claims not being supported by the retrieved documents. We hope this benchmark will help drive continued progress in addressing limitations of legal RAG systems.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.