MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
Quick Take
MHGraphBench evaluates LLMs on mental health knowledge using a knowledge-graph-grounded benchmark.
Key Points
- Benchmark assesses entity recognition, relation judgment, and reasoning.
- Leading models excel in entity typing but struggle with relation prediction.
- Output format affects performance in multiple-choice evaluations.
📖 Reader Mode
~2 min readAbstract:Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
| Comments: | Accepted to GEM 2026, ACL 2026 Workshop; 9 pages main text plus references and appendices |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.15589 [cs.CL] |
| (or arXiv:2605.15589v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15589 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Weixin Liu [view email]
[v1]
Fri, 15 May 2026 03:55:27 UTC (2,290 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.