DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
Quick Answer
The paper introduces Digit Entropy Loss (DEL) for improving numerical learning in large language models (LLMs) like CodeLlama and Mistral, outperforming existing methods in prediction accuracy across seven benchmarks.
Quick Take
The paper introduces Digit Entropy Loss (DEL) for improving numerical learning in large language models (LLMs) like CodeLlama and Mistral, outperforming existing methods in prediction accuracy across seven benchmarks. DEL reformulates entropy optimization to enhance number prediction, accommodating both integers and floating-point numbers.
Key Points
- DEL leverages digit conditional probability for supervised entropy optimization.
- It eliminates the distance term to avoid issues with numerical distance.
- The method generalizes learning from integers to floating-point numbers.
- Experiments show DEL consistently outperforms Number Token Loss and Discretized Distance Loss.
- Results are validated across seven mathematical reasoning benchmarks.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at this https URL
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.20369 [cs.CL] |
| (or arXiv:2605.20369v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.20369 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Zhaohui Zheng [view email]
[v1]
Tue, 19 May 2026 18:18:59 UTC (1,043 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.