Primary ICD Category Prediction using LLM-based Probing
Quick Answer
This study demonstrates that frozen MedFound-Llama3-8B LLM embeddings can effectively unify structured and unstructured EHR data for primary diagnosis prediction, achieving 91.45% medical accuracy on MIMIC-IV.
Quick Take
This study demonstrates that frozen MedFound-Llama3-8B LLM embeddings can effectively unify structured and unstructured EHR data for primary diagnosis prediction, achieving 91.45% medical accuracy on MIMIC-IV. The combined probing approach outperformed traditional methods like XGBoost, highlighting the potential for improved clinical coding efficiency.
Key Points
- Combined probing achieved 91.45% medical accuracy on MIMIC-IV dataset.
- Structured-only probes improved medical accuracy by 6.19 points over baselines.
- Diagnostic information became more separable in deeper transformer layers.
- A 2M-parameter adapter enabled cross-dataset transfer with minimal labels.
- Study supports efficient reuse of clinical representations across modalities.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Objective: ICD codes are central to reimbursement, research, and population health surveillance, yet automated coding systems often struggle to integrate diagnostic signals from both clinical narratives and structured electronic health record (EHR) variables. We evaluated whether frozen medical large language model (LLM) representations can serve as a shared embedding space for multimodal primary diagnosis category prediction.
Materials and Methods: We constructed a MIMIC-IV cohort of 13,645 admissions from the 10 most frequent primary ICD-10 codes, consolidated into seven categories. Structured variables were serialized into clinical narratives and combined with leakage-pruned discharge notes. Using a frozen MedFound-Llama3-8B-finetuned backbone, we extracted hidden states from five transformer layers and trained linear probes for structured-only, unstructured-only, and combined inputs, comparing against XGBoost and information-matched PLM-ICD baselines and evaluating MIMIC-III adaptation with a compact bottleneck adapter.
Results: The combined probe performed best on MIMIC-IV (87.69% strict; 91.45% medical accuracy), exceeding both single-modality probes and baselines. The structured-only probe outperformed its standard baseline by 6.19 points in medical accuracy. Diagnostic information became increasingly linearly separable in deeper layers, and a 2M-parameter adapter restored cross-dataset transfer to MIMIC-III using only 5% of target labels.
Discussion: LLM embeddings can unify structured and narrative EHR information for multimodal diagnosis prediction, supporting efficient reuse of clinical representations across modalities and datasets through a small representation-level module.
Conclusion: Multimodal probing of frozen medical LLM representations provides a practical approach for studying EHR modalities and adapting clinical representations across datasets.
| Comments: | 9 pages, 2 figures. Supplementary materials provided as an ancillary file |
| Subjects: | Artificial Intelligence (cs.AI); Applications (stat.AP) |
| Cite as: | arXiv:2606.28798 [cs.AI] |
| (or arXiv:2606.28798v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28798 arXiv-issued DOI via DataCite |
Submission history
From: Chengyuan Liu [view email]
[v1]
Sat, 27 Jun 2026 08:10:05 UTC (379 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.