A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models

arXiv cs.CL·Yassir El Attar, Esra D\"onmez, Maximilian Maurer, Agnieszka Falenska

6/4/2026

·~1 min·6/4/2026·en·1

Quick Answer

This paper shows that A large-scale study reveals that 284 linguistic features can effectively distinguish AI-generated text from human-written text across 27 LLMs and ten domains.

Quick Take

While many indicators are context-dependent, measures of lexical richness consistently serve as robust signals, enhancing interpretability for non-experts.

Key Points

Study assesses 284 linguistic features across 27 and ten text domains.
Classifiers based on linguistic features reliably distinguish AI and human text.
Lexical richness measures are robust across different models and domains.
Findings address gaps in understanding AI-generated text characteristics.
Results support more reliable analyses of AI-generated language.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

From the original publisher, up to about 700 characters

arXiv:2606. 04177v1 Announce Type: new Abstract: Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate -generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text.

Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

6d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust Judges for Evidence-based Research Agents?