A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries
Quick Answer
This study introduces a geometric framework for measuring semantic content in text using sentence embeddings, revealing a trade-off triangle among novelty, breadth, and integration.
Quick Take
This study introduces a geometric framework for measuring semantic content in text using sentence embeddings, revealing a trade-off triangle among novelty, breadth, and integration. The proposed scalars, $S_{minmax}$ and $S_{rank}$, outperform seven baselines, achieving 25 of 28 ordinal checks, and connect breadth to a determinantal point process with high correlation.
Key Points
- Developed a geometric framework for semantic content measurement in text.
- Introduced a three-coordinate semantic profile: novelty, breadth, and integration.
- Proved a no-go theorem for scalar summaries under various conditions.
- Achieved 25 of 28 ordinal checks with the rank-normalized configuration.
- Connected breadth to the log-determinant of a determinantal point process.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11222v1 Announce Type: new Abstract: How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts.
First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation.
Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $\tau$.
Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off.
The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $\rho = 0. 985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.