Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

arXiv cs.AI·Yan Wang, Ziyi Guo, Christopher McCarty

17h ago

·~2 min·5/20/2026·en·1

Quick Take

Large language models can enhance survey research by improving data quality in disaster preparedness contexts.

Key Points

Evaluated LLMs across five survey workflow stages.
A-TLM outperformed classical imputation methods significantly.
Proposed subgroup-stratified bias auditing as a new standard.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.

Subjects:	Artificial Intelligence (cs.AI)
MSC classes:	62D05, 68T50, 62F10, 62-07, 91C20
ACM classes:	H.3.5; I.2.7; H.2.8; I.2.6; J.4
Cite as:	arXiv:2605.19229 [cs.AI]
	(or arXiv:2605.19229v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.19229 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yan Wang [view email]
[v1] Tue, 19 May 2026 00:58:36 UTC (540 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?