Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

arXiv cs.CL·Sangwhan Moon, Daisuke Oba, Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

This paper shows that A 355M parameter language model trained on 80B tokens reveals that UTF-8 validity lags behind perplexity, stabilizing after 4.2B tokens compared to 2.1B for perplexity.

Quick Take

A 355M parameter language model trained on 80B tokens reveals that UTF-8 validity lags behind perplexity, stabilizing after 4.2B tokens compared to 2.1B for perplexity. This highlights the need for distinct evaluation of UTF-8 generation capabilities beyond traditional metrics.

Key Points

UTF-8 validity requires 4.2B tokens to stabilize, twice that of perplexity.
Rare characters exhibit higher structural validity than common ones in context-free generation.
Evaluation protocols isolate UTF-8 structural validity from language modeling performance.
Reliable UTF-8 generation is a distinct capability needing separate assessment.
The study uses a balanced multilingual corpus including English, Japanese, Korean, and Chinese.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 14122v1 Announce Type: new Abstract: Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese.

We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2. 1B tokens, but UTF-8 validity requires 4. 2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations.

Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy