Cross-Platform Chinese Offensive Comment Detection via Dual-Threshold Hard Example Mining
Quick Answer
This paper introduces a dual-threshold hard example mining method to enhance cross-platform offensive comment detection in Chinese social media.
Quick Take
This paper introduces a dual-threshold hard example mining method to enhance cross-platform offensive comment detection in Chinese social media. By fine-tuning a clean-Chinese-base RoBERTa model on a three-class dataset from Weibo, Xiaohongshu, Tieba, and Zhihu, the approach significantly improves performance across platforms with minimal manual labeling required.
Key Points
- Introduces dual-threshold hard example mining for offensive comment detection.
- Fine-tunes clean-Chinese-base RoBERTa on a three-class dataset.
- Quantifies domain distances using Jaccard and Proxy-A Distance metrics.
- Achieves significant performance gains across Weibo, Xiaohongshu, Tieba, and Zhihu.
- Requires minimal manual labeling for effective cross-platform adaptation.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Cross-platform deployment of offensive comment detection for Chinese social media suffers performance degradation. The paper proposes a dual-threshold hard mining method to address this. First, the clean-Chinese-base RoBERTa is finetuned on COLD to establish a binary baseline for fair comparison. Second, a three-class fine-labeled test set covering Weibo, Xiaohongshu, Tieba, and Zhihu is constructed, domain distances from the source are quantified using Jaccard and Proxy-A Distance, as well as the degradation bottleneck of the baseline under domain shift is systematically revealed. Herein, a dual threshold hard example mining strategy is proposed. High- and low-confidence error-prone samples are filtered from unlabeled corpora by prediction confidence. The model is secondarily finetuned under implicit contexts with merely a small set of manually labeled hard examples, realizing low-cost cross-platform domain adaptation. Experiments reveal significant performance gains of the optimized model across four platforms.
| Comments: | 10 pages, 7 figures |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) |
| MSC classes: | 68T50, 68U15, 91F10 |
| ACM classes: | I.2.7; I.2.6; H.3.4 |
| Cite as: | arXiv:2606.27629 [cs.CL] |
| (or arXiv:2606.27629v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.27629 arXiv-issued DOI via DataCite |
Submission history
From: Junhui Zhao [view email]
[v1]
Fri, 26 Jun 2026 00:56:11 UTC (583 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.