Montreal Forced Aligner and the state of speech-to-text alignment in 2026
Quick Answer
This paper shows that The Montreal Forced Aligner (MFA) 3.0, released in 2026, outperforms classic and neural forced aligners with mean boundary errors below 15 ms across English, Japanese, and Korean.
Quick Take
The Montreal Forced Aligner (MFA) 3.0, released in 2026, outperforms classic and neural forced aligners with mean boundary errors below 15 ms across English, Japanese, and Korean. Enhanced features include expanded language support, model adaptation, and effective cross-language remapping, solidifying MFA's position as the leading tool in speech-to-text alignment.
Key Points
- MFA 3.0 shows state-of-the-art performance across four benchmark datasets.
- Mean boundary errors are consistently below 15 ms for evaluated languages.
- Adaptation techniques improve performance for languages outside the training distribution.
- Cross-language phone remapping enhances alignment accuracy for diverse dialects.
- Pronunciation probability modeling yields gains under specific phonological conditions.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.
0's developments since version 1. 0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3. 0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.