Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems
Quick Answer
This paper introduces a non-autoregressive scoring method for punctuation restoration in streaming ASR systems, achieving a macro F1 score of 0.893 without fine-tuning and 0.937 after fine-tuning on the IWSLT 2017 benchmark.
Quick Take
This paper introduces a non-autoregressive scoring method for punctuation restoration in streaming ASR systems, achieving a macro F1 score of 0.893 without fine-tuning and 0.937 after fine-tuning on the IWSLT 2017 benchmark. The method uses a bounded K-subword-token lookahead to make incremental punctuation decisions, significantly outperforming existing prompt-based and ELECTRA baselines.
Key Points
- Achieved a macro F1 score of 0.893 without fine-tuning on IWSLT 2017.
- Fine-tuning improved the score to 0.937, outperforming prompt-based methods.
- Utilizes a bounded K-subword-token lookahead for efficient decision-making.
- Non-autoregressive method avoids latency and alignment issues in streaming ASR.
- No parameter updates are required during inference, enhancing real-time performance.
Article Excerpt
From source RSS / original summaryarXiv:2606. 05179v1 Announce Type: new Abstract: Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation.
This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight {\alpha} and a validation-calibrated threshold {\tau} (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.
893 in the no fine-tuning setting (validation-calibrated, K=2) and 0. 937 after fine-tuning (K=2), outperforming the prompt-based baseline (0. 566) and a fine-tuned ELECTRA baseline (0. 913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.