BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon
Quick Take
The BOUTEF corpus addresses the challenge of fake news in North Africa by providing a multilingual dataset that includes fake and genuine narratives, user comments, and debunking information. This resource enables a detailed analysis of thematic distributions and engagement dynamics, revealing that emotionally charged narratives significantly enhance the virality of fake news, while debunking content is more factual. The study highlights the importance of informal language practices in misinformation dissemination.
Key Points
- BOUTEF corpus includes fake narratives, genuine narratives, and user comments.
- Analyzes thematic distributions and sentiment patterns in fake news.
- Findings show emotional narratives enhance fake news virality.
- Debunking content is more factual and verification-oriented.
- Study highlights country-specific dynamics in Algeria and Tunisia.
Article Content
From source RSS / original summaryarXiv:2606. 00193v1 Announce Type: new Abstract: The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia.
The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches.
We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement.
In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation.
By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.