FormalASR: End-to-End Spoken Chinese to Formal Text

arXiv cs.CL·Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

17h ago

·~2 min·5/20/2026·en·0

Quick Take

FormalASR offers end-to-end transcription from spoken Chinese to formal text without post-processing.

Key Points

Two compact models: 0.6B and 1.7B parameters.
Achieves 37.4% relative CER reduction over baselines.
Lightweight solution for on-device transcription.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.19266 [cs.CL]
	(or arXiv:2605.19266v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19266 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wanyi Ning [view email]
[v1] Tue, 19 May 2026 02:27:27 UTC (1,244 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

FormalASR: End-to-End Spoken Chinese to Formal Text

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets