RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring
Quick Answer
The study introduces a retrieval-constrained LLM adjudication method that significantly enhances clinical value set authoring, achieving a macro F1 score of 0.549 using GPT-5.
Quick Take
The study introduces a retrieval-constrained LLM adjudication method that significantly enhances clinical value set authoring, achieving a macro F1 score of 0.549 using GPT-5. This approach optimizes candidate selection from a recall-enhanced pool, improving recall from 0.553 to 0.730 on the RASC benchmark. The results indicate that effective retrieval strategies can lead to better performance in clinical coding tasks.
Key Points
- RASC benchmark shows LLMs struggle with clinical code generation.
- Recall improved from 0.553 to 0.730 with Qwen3-based retrieval.
- Macro F1 score increased to 0.549 using GPT-5 for candidate selection.
- Original SAPBert cross-encoder achieved only 0.287 macro F1.
- All returned codes must come from an auditable candidate pool.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 23992v1 Announce Type: new Abstract: Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that direct zero-shot large language model (LLM) generation is poorly suited to this task: clinical code systems are large, version-controlled, and not reliably memorized by language models.
We study a stage-wise alternative in which candidate-pool construction is optimized for recall and a constrained LLM adjudicator is optimized for candidate selection. On the full 3,744-value-set RASC test split, Qwen3-based retrieval with vocabulary-aware expansion and code-display rescue retrieval increases candidate-pool recall from the original RASC retrieval baseline of 0. 553 to 0. 730; on the held-out-publisher stratum, pool recall is 0. 655.
The higher-recall pool alone is not sufficient: applying the original SAPBert cross-encoder to this expanded pool gives full-test macro F1 of 0. 287 and held-out-publisher macro F1 of 0. 233. Replacing the stage-2 selector with blinded GPT-5 adjudication over the same pool increases full-test macro F1 to 0. 549 and held-out-publisher macro F1 to 0. 533.
These results show that retrieval-constrained LLM adjudication can substantially improve value set completion while preserving the safety constraint that all returned codes must come from an auditable candidate pool.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.