Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities
Quick Take
The CARE framework benchmarks LLMs against real community reactions, revealing a 'realism gap' in simulation fidelity. Current alignment strategies are inadequate for capturing sociolinguistic dynamics, as evidenced by divergent behaviors among leading models.
Key Points
- CARE evaluates LLMs against authentic community responses to real-world events.
- The framework identifies a persistent 'realism gap' in LLM simulations.
- Divergent behavioral signatures among models indicate misalignment with community dynamics.
- Explicit community prompts do not inherently improve LLM simulation fidelity.
- Human-AI collaboration validates the framework's effectiveness in assessing linguistic behaviors.
Article Excerpt
From source RSS / original summaryarXiv:2605. 27388v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts.
To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news.
By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.