Polar: A Benchmark for Evaluating Political Bias in LLMs

arXiv cs.CL·Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee

1d ago

·~1 min·6/12/2026·en·1

Quick Answer

Polar is a new benchmark consisting of 4,026 instances to evaluate political bias in LLMs across U.S.

Quick Take

Polar is a new benchmark consisting of 4,026 instances to evaluate political bias in LLMs across U.S. and South Korean contexts. It reveals that 38 tested LLMs exhibit systematic bias, leaning left-progressive in U.S. content while showing mixed patterns in South Korean content, emphasizing the need for multilingual bias evaluation.

Key Points

Polar benchmark measures political bias through option-level likelihoods, not prompt generation.
It covers two ideological axes and eight issue categories from the Manifesto Project.
All 38 LLMs tested lean left-progressive on U.S. political content.
Bias varies significantly based on political context, issue category, and presentation language.
Translation experiments indicate that presentation language can shift measured bias.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U. S.

and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U. S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy