A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries

How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $τ$. Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $ρ= 0.985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.

翻译：文本承载了多少意义？香农的理论度量符号的不确定性，有意忽略意义，而BERTScore等成对度量比较两个文本而非描述单个文本。我们发展了一个几何框架，从文本句子嵌入的结构中测量语义内容。该框架包含三个部分。第一，在固定嵌入和基线内，六个自然公理唯一确定一个标量度量（仅相差一个尺度因子），即帧条件唯一性定理。所得标量在实证上过于粗糙，这激发了更丰富的表示。第二，我们提出了一个三坐标语义剖面，捕捉新颖性（与通用话语的位移）、广度（不同思想的多样性）和整合度（思想间的连通性），以及一个离散最小单元（语义量子），其分辨率由聚类阈值τ固定。第三，我们证明了一个不可行定理：剖面的任何标量摘要都无法同时满足在释义和拼接下的解析稳定性、跨文本尺度的序数鲁棒性以及跨表示的可比性。我们展示了两个实用标量，S_minmax和S_rank，每个占据该权衡三角形的不同角落。在23个合成类别、5部古腾堡计划小说和3个嵌入模型上的验证确认了该权衡。推荐的秩归一化配置在28个序数检验中通过了25个点估计（经Benjamini-Hochberg校正后为21个），优于包括单字熵和基于BERTScore的新颖性信号在内的七个基线。一个单独的变分结果将广度坐标与行列式点过程的对数行列式联系起来（在507个古腾堡章节上Spearman ρ=0.985），为广度提供了优化理论基础。