BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts
Quick Answer
BenSyc is the first benchmark for assessing conversational sycophancy in Bengali contexts, revealing that leading LLMs struggle with empathetic support versus validation, achieving only 61.8 Macro-F1 in binary detection.
Quick Take
BenSyc is the first benchmark for assessing conversational sycophancy in Bengali contexts, revealing that leading LLMs struggle with empathetic support versus validation, achieving only 61.8 Macro-F1 in binary detection. Evaluating over 15 models, findings indicate significant variability in responses, emphasizing the need for culturally relevant benchmarks in AI.
Key Points
- BenSyc benchmark includes 11,840 Reddit posts and 170k comments from Bengali communities.
- Models achieved 61.8 Macro-F1 on binary detection of conversational alignment.
- Significant challenges remain in distinguishing empathetic support from validation.
- Findings reveal strong validating responses in emotionally charged situations.
- Emphasizes the need for culturally grounded multilingual benchmarks in AI.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored.
We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks.
Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61. 8 Macro-F1 on binary detection and 61. 7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations.
Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.