Polar: A Benchmark for Evaluating Political Bias in LLMs
Quick Answer
Polar is a new benchmark consisting of 4,026 instances to evaluate political bias in LLMs across U.S.
Quick Take
Polar is a new benchmark consisting of 4,026 instances to evaluate political bias in LLMs across U.S. and South Korean contexts. It reveals that 38 tested LLMs exhibit systematic bias, leaning left-progressive in U.S. content while showing mixed patterns in South Korean content, emphasizing the need for multilingual bias evaluation.
Key Points
- Polar benchmark measures political bias through option-level likelihoods, not prompt generation.
- It covers two ideological axes and eight issue categories from the Manifesto Project.
- All 38 LLMs tested lean left-progressive on U.S. political content.
- Bias varies significantly based on political context, issue category, and presentation language.
- Translation experiments indicate that presentation language can shift measured bias.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U. S.
and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U. S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.