MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

arXiv cs.CL·Fatimah Almalki, Areej Alhothali, Lulwah Alharigy, Abdulrahman Aladeem

1d ago

·~2 min·6/12/2026·en·0

Quick Answer

The study introduces MentalMARBERT, a domain-adapted model for Arabic mental health disorder detection, achieving a macro-F1 score of 0.861 and accuracy of 0.877.

Quick Take

The study introduces MentalMARBERT, a domain-adapted model for Arabic mental health disorder detection, achieving a macro-F1 score of 0.861 and accuracy of 0.877. Utilizing a two-phase framework with DAPT and TAPT, it outperforms baseline models significantly. A novel dataset of 50,670 tweets across six categories supports this research.

Key Points

MentalMARBERT shows significant improvements over baseline models in accuracy and macro-F1.
The model was evaluated using a novel dataset of 50,670 annotated Arabic tweets.
Hierarchical two-stage architecture combined with full fine-tuning yielded the best performance.
The study addresses challenges like dialectal variation and class imbalance in Arabic NLP.
Strong inter-annotator agreement was achieved with Krippendorff's Alpha of 0.733.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12649v1 Announce Type: new Abstract: Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance. While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied. This study proposes a two-phase framework for Arabic mental health text classification.

In phase 1, three Arabic pre-trained language models, AraBERT, CAMeLBERT, and MARBERT, undergo Domain-Adaptive and Task-Adaptive Pretraining (DAPT and TAPT) using a large-scale corpus of unlabeled Arabic mental health tweets. The adapted models are evaluated under a unified protocol to identify the most effective backbone model.

In phase 2, the selected model is assessed across four configurations combining single-stage and hierarchical two-stage classification architectures with full fine-tuning and Low-Rank Adaptation (LoRA). To support this study, we constructed a novel annotated Arabic mental health dataset comprising 50,670 tweets across six categories, with strong inter annotator agreement (Krippendorff's Alpha = 0. 733, average pairwise agreement = 0. 797).

Experimental results show that the domain-adapted MARBERT (MentalMARBERT) achieves statistically significant improvements over baseline models in both accuracy and macro-F1. The hierarchical two-stage architecture combined with full fine-tuning achieves the best overall performance, reaching a macro-F1 of 0. 861 and an accuracy of 0. 877. These findings demonstrate the effectiveness of domain-specific adaptive pretraining and hierarchical classification for Arabic mental health disorder detection.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy