Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

arXiv cs.CL·Konstantin Berlin, Adam Swanda

3h ago

·~2 min·5/26/2026·en·0

Quick Take

AI-driven workflows enhance labeling consistency by providing detailed constitutional definitions for content moderation.

Key Points

Prescriptive definitions reduce labeler disagreement.
AI interprets definitions for consistent golden labels.
Cross-model inconsistency reduced by up to 57x.

Article Content

From source RSS / original summary

arXiv:2605. 24247v1 Announce Type: new Abstract: Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation.

In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency.

We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document.

We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls.

For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Extracting Training Data from Diffusion Language Models via Infilling

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Related in this space

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems