CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards
Quick Take
CSRP introduces a three-stage framework for Chinese Grammatical Error Correction, achieving state-of-the-art performance on the NACGEC benchmark with 50.99 F0.5 and 57.17 precision. This method effectively reduces over-correction bias seen in MLE-trained models and surpasses GPT-4 in spelling correction by 5.20 points.
Key Points
- CSRP utilizes Continual Pre-training on 5.9M samples to enhance domain knowledge.
- Chain-of-Thought SFT improves diagnostic transparency through explicit error reasoning.
- Group Relative Policy Optimization employs Efficiency-Aware Rewards to minimize unnecessary edits.
- CSRP achieves an 8% relative gain over the SFT baseline in grammatical error correction.
- The method advances CSCD spelling correction performance to 59.61 F1.
Article Content
From source RSS / original summaryarXiv:2606. 00020v1 Announce Type: new Abstract: Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction.
We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5. 9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50. 99 $F_{0. 5}$ and 57.
17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59. 61 F1, surpassing GPT-4 by 5. 20 points.
Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8\% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at https://github. com/TW-NLP/ChineseErrorCorrector.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.