Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

arXiv cs.CL·Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

Quick Take

This study evaluates the effectiveness of post-training methods for generative LLMs in ICD coding, revealing that supervised fine-tuning (SFT) significantly enhances performance, while reinforcement learning (RL) further improves code prediction. Notably, the research introduces PHI, a curriculum that targets missed-code cases, demonstrating that the generative model's limitations stem from adaptation rather than inherent capability.

Key Points

Post-training methods significantly enhance generative LLMs' ICD coding capabilities.
Supervised fine-tuning (SFT) is the primary driver of performance improvement.
Reinforcement learning (RL) further refines code-set predictions beyond SFT.
The PHI curriculum targets specific missed-code cases for better accuracy.
Prompting-only evaluations underestimate LLMs' potential in medical coding.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 13940v1 Announce Type: new Abstract: Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or , leaving the role of task-specific post-training underexplored.

We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases.

Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.

com/AlexandreWANG915/LLM4ICD.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy