DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection
Quick Take
DetectRL-X benchmarks multilingual LLM-generated text detection across diverse real-world scenarios.
Key Points
- Evaluates detectors across 8 dimensions in 8 languages.
- Simulates real-world usage with various LLMs and writing styles.
- Analyzes performance impacts from domain and modification strategies.
📖 Reader Mode
~2 min readAuthors:Junchao Wu, Yefeng Liu, Chenyu Zhu, Hao Zhang, Zeyu Wu, Tianqi Shi, Yichao Du, Longyue Wang, Weihua Luo, Jinsong Su, Derek F. Wong
Abstract:The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.
| Comments: | ACL 2026 Main |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.15518 [cs.CL] |
| (or arXiv:2605.15518v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15518 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Junchao Wu [view email]
[v1]
Fri, 15 May 2026 01:29:26 UTC (10,065 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.