CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

arXiv cs.CL·Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead, Vered Shwartz

4h ago

·~1 min·6/1/2026·en·0

Quick Take

CanLegalRAGBench introduces a Canadian legal QA benchmark focusing on realistic queries and expert-annotated answers, revealing that 8-29% of generated claims lack support from retrieved documents. The study highlights the sensitivity of retrieval performance to design choices and the competitiveness of open-source embedding models against closed-source ones.

Key Points

CanLegalRAGBench addresses the underrepresentation of Canadian law in legal evaluations.
Retrieval performance is sensitive to design choices in the benchmark.
Open-source embedding models perform competitively with closed-source models.
Generated answers often diverge from gold responses, with significant hallucinations.
8-29% of claims made by systems are unsupported by retrieved documents.

Article Excerpt

From source RSS / original summary

arXiv:2605. 30497v1 Announce Type: new Abstract: RAG-based legal assistants have been growing in popularity, but LLM hallucinations remain a key issue and potentially undermines justice. While benchmarks have been developed to evaluate progress, many rely on synthetic queries rather than realistic legal scenarios. Moreover, Canadian law remains underrepresented in existing evaluations.

To address this gap, we introduce CanLegalRAGBench, a Canadian legal QA benchmark based on realistic queries and expert-annotated answers grounded in case law. Our evaluation shows that retrieval performance is sensitive to design choices and that open-source embedding models are competitive with closed source models. However, it also reveals the limitation of automatic evaluations that penalize systems for retrieving alternative relevant documents.

We also find that generated answers often diverge from gold responses, either with hallucinations or by producing overly detailed or irrelevant content, with 8-29% of claims not being supported by the retrieved documents. We hope this benchmark will help drive continued progress in addressing limitations of legal RAG systems.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy