BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

arXiv cs.CL·Jin Huang, Yutong Xie, Wanli Song, Xingjian Zhang, Walter Yuan, Matthew O. Jackson, Qiaozhu Mei

4h ago

·~2 min·6/24/2026·en·0

Quick Answer

This paper shows that BehaviorBench benchmarks foundation models for behavioral science, revealing that Be.FM-1.5 excels in distributional alignment while maintaining competitive individual-level metrics.

Quick Take

BehaviorBench benchmarks foundation models for behavioral science, revealing that Be.FM-1.5 excels in distributional alignment while maintaining competitive individual-level metrics. Proprietary models perform well in individual tasks but lack broader population alignment, highlighting the need for behavioral adaptation in AI systems.

Key Points

BehaviorBench evaluates models on behavior prediction, decision-making, trait inference, and knowledge application.
Be.FM-1.5 shows superior distributional alignment compared to general-purpose models.
Proprietary models excel in individual-level predictions but lack population-level validity.
BehaviorBench serves as a foundation for developing behaviorally aligned AI systems.
Models can be accessed at https://umich-foreseer.github.io/behaviorbench/.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 24162v1 Announce Type: new Abstract: Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations.

We introduce BehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. Crucially, BehaviorBench evaluates model outputs at both the individual and distributional levels, capturing not only per-subject accuracy but also population-level alignment, an essential requirement for behavioral validity.

Leveraging the tasks in BehaviorBench, we further develop Be. FM-1. 5, extending the Be. FM family of behavioral foundation models fine-tuned on behavioral data. Our results reveal a considerable gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models, fine-tuned on behavioral data, achieve substantially stronger distributional alignment. Notably, Be. FM-1.

5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting that proper behavioral adaptation can close the gap. Our results highlight the importance of distributional evaluation, establish BehaviorBench as a foundation for developing and assessing behaviorally aligned AI systems, and demonstrate Be. FM-1. 5's potential for a broad range of behavioral science studies. Our BehaviorBench and Be. FM-1. 5 models can be accessed via https://umich-foreseer. github. io/behaviorbench/.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

4h ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems