BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
Quick Answer
This paper shows that BehaviorBench benchmarks foundation models for behavioral science, revealing that Be.FM-1.5 excels in distributional alignment while maintaining competitive individual-level metrics.
Quick Take
BehaviorBench benchmarks foundation models for behavioral science, revealing that Be.FM-1.5 excels in distributional alignment while maintaining competitive individual-level metrics. Proprietary models perform well in individual tasks but lack broader population alignment, highlighting the need for behavioral adaptation in AI systems.
Key Points
- BehaviorBench evaluates models on behavior prediction, decision-making, trait inference, and knowledge application.
- Be.FM-1.5 shows superior distributional alignment compared to general-purpose models.
- Proprietary models excel in individual-level predictions but lack population-level validity.
- BehaviorBench serves as a foundation for developing behaviorally aligned AI systems.
- Models can be accessed at https://umich-foreseer.github.io/behaviorbench/.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 24162v1 Announce Type: new Abstract: Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations.
We introduce BehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. Crucially, BehaviorBench evaluates model outputs at both the individual and distributional levels, capturing not only per-subject accuracy but also population-level alignment, an essential requirement for behavioral validity.
Leveraging the tasks in BehaviorBench, we further develop Be. FM-1. 5, extending the Be. FM family of behavioral foundation models fine-tuned on behavioral data. Our results reveal a considerable gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models, fine-tuned on behavioral data, achieve substantially stronger distributional alignment. Notably, Be. FM-1.
5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting that proper behavioral adaptation can close the gap. Our results highlight the importance of distributional evaluation, establish BehaviorBench as a foundation for developing and assessing behaviorally aligned AI systems, and demonstrate Be. FM-1. 5's potential for a broad range of behavioral science studies. Our BehaviorBench and Be. FM-1. 5 models can be accessed via https://umich-foreseer. github. io/behaviorbench/.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.