(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable
Quick Answer
This paper shows that Human-in-the-Loop Economic Research (HLER) significantly enhances the reliability of AI-assisted social science, reducing failure rates from 72% to 16% through structured human oversight.
Quick Take
Human-in-the-Loop Economic Research (HLER) significantly enhances the reliability of AI-assisted social science, reducing failure rates from 72% to 16% through structured human oversight. This approach emphasizes cognitive labor distribution, with LLMs reasoning but not executing data work, and three human decision gates ensuring accountability.
Key Points
- HLER reduced critical failures in AI-assisted research from 72% to 16%.
- The study involved 280 complete research runs across four datasets.
- Deterministic computation and human decision gates independently contribute to reliability.
- Largest reliability gains were observed with a Qing-dynasty population register dataset.
- HLER acts as a research harness, preventing unreliable claims in publications.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12848v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines.
We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs.
Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0. 001.
Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity.
We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.