RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

arXiv cs.CL·Sumit Mukherjee

4h ago

·~1 min·6/24/2026·en·0

Quick Answer

The study introduces a retrieval-constrained LLM adjudication method that significantly enhances clinical value set authoring, achieving a macro F1 score of 0.549 using GPT-5.

Quick Take

The study introduces a retrieval-constrained LLM adjudication method that significantly enhances clinical value set authoring, achieving a macro F1 score of 0.549 using GPT-5. This approach optimizes candidate selection from a recall-enhanced pool, improving recall from 0.553 to 0.730 on the RASC benchmark. The results indicate that effective retrieval strategies can lead to better performance in clinical coding tasks.

Key Points

RASC benchmark shows LLMs struggle with clinical code generation.
Recall improved from 0.553 to 0.730 with Qwen3-based retrieval.
Macro F1 score increased to 0.549 using GPT-5 for candidate selection.
Original SAPBert cross-encoder achieved only 0.287 macro F1.
All returned codes must come from an auditable candidate pool.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 23992v1 Announce Type: new Abstract: Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that direct zero-shot large language model (LLM) generation is poorly suited to this task: clinical code systems are large, version-controlled, and not reliably memorized by language models.

We study a stage-wise alternative in which candidate-pool construction is optimized for recall and a constrained LLM adjudicator is optimized for candidate selection. On the full 3,744-value-set RASC test split, Qwen3-based retrieval with vocabulary-aware expansion and code-display rescue retrieval increases candidate-pool recall from the original RASC retrieval baseline of 0. 553 to 0. 730; on the held-out-publisher stratum, pool recall is 0. 655.

The higher-recall pool alone is not sufficient: applying the original SAPBert cross-encoder to this expanded pool gives full-test macro F1 of 0. 287 and held-out-publisher macro F1 of 0. 233. Replacing the stage-2 selector with blinded GPT-5 adjudication over the same pool increases full-test macro F1 to 0. 549 and held-out-publisher macro F1 to 0. 533.

These results show that retrieval-constrained LLM adjudication can substantially improve value set completion while preserving the safety constraint that all returned codes must come from an auditable candidate pool.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

4h ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems