Primary ICD Category Prediction using LLM-based Probing

arXiv cs.AI·Chengyuan Liu, Xinyue Zhang, Yao Li, Guanting Chen

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

Quick Take

This study demonstrates that frozen MedFound-Llama3-8B LLM embeddings can effectively unify structured and unstructured EHR data for primary diagnosis prediction, achieving 91.45% medical accuracy on MIMIC-IV. The combined probing approach outperformed traditional methods like XGBoost, highlighting the potential for improved clinical coding efficiency.

Key Points

Combined probing achieved 91.45% medical accuracy on MIMIC-IV dataset.
Structured-only probes improved medical accuracy by 6.19 points over baselines.
Diagnostic information became more separable in deeper transformer layers.
A 2M-parameter adapter enabled cross-dataset transfer with minimal labels.
Study supports efficient reuse of clinical representations across modalities.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:Objective: ICD codes are central to reimbursement, research, and population health surveillance, yet automated coding systems often struggle to integrate diagnostic signals from both clinical narratives and structured electronic health record (EHR) variables. We evaluated whether frozen medical large language model (LLM) representations can serve as a shared embedding space for multimodal primary diagnosis category prediction.
Materials and Methods: We constructed a MIMIC-IV cohort of 13,645 admissions from the 10 most frequent primary ICD-10 codes, consolidated into seven categories. Structured variables were serialized into clinical narratives and combined with leakage-pruned discharge notes. Using a frozen MedFound-Llama3-8B-finetuned backbone, we extracted hidden states from five transformer layers and trained linear probes for structured-only, unstructured-only, and combined inputs, comparing against XGBoost and information-matched PLM-ICD baselines and evaluating MIMIC-III adaptation with a compact bottleneck adapter.
Results: The combined probe performed best on MIMIC-IV (87.69% strict; 91.45% medical accuracy), exceeding both single-modality probes and baselines. The structured-only probe outperformed its standard baseline by 6.19 points in medical accuracy. Diagnostic information became increasingly linearly separable in deeper layers, and a 2M-parameter adapter restored cross-dataset transfer to MIMIC-III using only 5% of target labels.
Discussion: LLM embeddings can unify structured and narrative EHR information for multimodal diagnosis prediction, supporting efficient reuse of clinical representations across modalities and datasets through a small representation-level module.
Conclusion: Multimodal probing of frozen medical LLM representations provides a practical approach for studying EHR modalities and adapting clinical representations across datasets.

Comments:	9 pages, 2 figures. Supplementary materials provided as an ancillary file
Subjects:	Artificial Intelligence (cs.AI); Applications (stat.AP)
Cite as:	arXiv:2606.28798 [cs.AI]
	(or arXiv:2606.28798v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.28798 arXiv-issued DOI via DataCite

Submission history

From: Chengyuan Liu [view email]
[v1] Sat, 27 Jun 2026 08:10:05 UTC (379 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy