Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
Quick Answer
This paper shows that The LLaMA 3.1 model demonstrates high performance in extracting structured information from Dutch brain MRI reports, achieving 90% accuracy for medial temporal atrophy and 93% for microbleed mentions.
Quick Take
The LLaMA 3.1 model demonstrates high performance in extracting structured information from Dutch brain MRI reports, achieving 90% accuracy for medial temporal atrophy and 93% for microbleed mentions. Few-shot prompting significantly enhances numerical data extraction, indicating strong potential for large-scale neuroradiology research.
Key Points
- LLaMA 3.1 achieved 90% accuracy for medial temporal atrophy in MRI reports.
- Microbleed mentions were detected with 93% accuracy using the model.
- Few-shot prompting improved numerical variable extraction significantly.
- Performance metrics were evaluated across 947 Dutch neuroradiology reports.
- Challenges remain for location-specific variables despite high overall accuracy.
Article Content
From source RSS / original summaryarXiv:2606. 07721v1 Announce Type: new Abstract: Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability.
We evaluated the performance of the open-weight LLM LLaMA 3. 1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.
1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0. 95 [0. 95-0. 96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts.
English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3. 1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.