Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL·Sajjad Abdoli (MAD), Ghassan Al-Sumaidaee (MAD), Clayton W. Taylor (MAD), Ahmad (MAD), ElShiekh, Ahmed Rashad

5/20/2026

·~2 min·5/20/2026·en·8

Quick Answer

This study benchmarks five commercial ASR systems on code-switching speech in Arabic, Persian, and German, revealing ElevenLabs Scribe v2 as the top performer with a 13.2% WER.

Quick Take

This study benchmarks five commercial ASR systems on code-switching speech in Arabic, Persian, and German, revealing ElevenLabs Scribe v2 as the top performer with a 13.2% WER. The evaluation highlights BERTScore as a more reliable metric, especially for Arabic and Persian, where traditional WER may misrepresent performance due to transliteration issues. The dataset is publicly available for further research.

Key Points

ElevenLabs Scribe v2 achieved the lowest WER of 13.2% across all tested languages.
BERTScore outperformed WER in evaluating Arabic and Persian ASR performance.
The benchmarking dataset includes 300 samples from four language pairs.
LLM scoring costs were reduced by approximately 91% using a two-stage pipeline.
Performance gaps were revealed through difficulty-stratified analysis.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF

Abstract:Code-switching -- the natural alternation between two languages within a single utterance -- represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR benchmarks predominantly evaluate clean, monolingual audio and report a single Word Error Rate (WER) figure that tells practitioners little about real-world multilingual performance. We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English. Each dataset comprises 300 samples selected by a two-stage pipeline: a heuristic filter scoring transcripts on five structural code-switching signals, followed by a GPT-4o and Gemini 1.5 Pro ensemble scoring candidates across six linguistic dimensions. This pipeline reduces LLM scoring costs by approximately 91\% relative to exhaustive scoring. We evaluate the systems on both WER and BERTScore, arguing that BERTScore is a more reliable metric for Arabic and Persian pairs where transliteration variance causes WER to penalise semantically correct transcriptions. ElevenLabs Scribe v2 achieves the lowest WER across all four language pairs (13.2% overall; 13.1% on Egyptian Arabic) and leads on BERTScore (0.936 overall). We further demonstrate that difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and that BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The benchmarking dataset is publicly available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.19069 [cs.CL]
	(or arXiv:2605.19069v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19069 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sajjad Abdoli [view email]
[v1] Mon, 18 May 2026 19:50:44 UTC (748 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems