Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

arXiv cs.CL·Geoffrey Martin, Xuan Zhong Feng, Yifan Peng

5/22/2026

·~2 min·5/22/2026·en·2

Quick Answer

A study compares LLMs and fine-tuned RoBERTa for extracting circumstances from NVDRS narratives, revealing LLMs excel in low-prevalence scenarios.

Quick Take

A study compares LLMs and fine-tuned RoBERTa for extracting circumstances from NVDRS narratives, revealing LLMs excel in low-prevalence scenarios. The hybrid model adapts prompt strategies based on complexity, with GPT-5.2, Gemini 2.5 Pro, and Llama-3 70B showing consistent performance across complex circumstances.

Key Points

LLMs outperform fine-tuned models in low-prevalence circumstances with insufficient training data.
A 'Complexity Score' algorithm predicts when detailed prompts enhance performance.
Hybrid approach selects prompt strategies based on the complexity of circumstances.
Study evaluated 25 inferentially complex circumstances from NVDRS.
Findings support using LLMs for rare, complex cases and fine-tuned models for common ones.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 21 May 2026]

View PDF HTML (experimental)

Abstract:Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

Comments:	Accepted at IEEE ICHI 2026
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.21845 [cs.CL]
	(or arXiv:2605.21845v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.21845 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Geoffrey Martin [view email]
[v1] Thu, 21 May 2026 00:33:52 UTC (20 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems