We present a systematic stress-test of geometric metrics for LLM evaluation. Rank-based geometric properties of internal representations have shown promise as reference-free quality signals, but the conditions under which they are reliable remain unclear. We evaluate eight commonly-used metrics: intrinsic-dimensionality estimators, spectral norms, and related quantities across six tester models (0.5-8B) and eight generators on contrasting tasks, separating genuine geometric signal from text-length effects and from what standard text statistics already capture. Three findings emerge. First, some metrics (notably Schatten Norm and MOM) mainly reflect output length, and their apparent discriminative power collapses once length is controlled. Second, geometric metrics add modest but real information beyond text statistics: combined with them, a classifier reaches 78% accuracy on 6-way generator identification versus 69% for text statistics alone. Third, rather than tracking a general notion of text quality, the metrics demonstrate only moderate association between the intrinsic-dimensionality and lexical diversity (RTTR). We give use-case-specific recommendations and identify failure detection as the most promising near-term application.
翻译:我们提出了一套系统性的压力测试,旨在评估用于大语言模型评估的几何度量。基于排名的内部表示几何属性作为无需参考的质量信号展现出一定前景,但其可靠性的边界条件仍不明确。我们评估了八种常用度量:内在维度估计量、谱范数及相关量,涵盖六个测试模型(0.5B-8B参数规模)和八个生成器,在对比性任务中区分了真正的几何信号与文本长度效应以及标准文本统计量已捕获的信息。结果揭示了三点发现:其一,部分度量(尤其是Schatten范数和MOM)主要反映输出长度,一旦控制长度变量其表观区分力即告失效;其二,几何度量能提供超越文本统计量的适度真实信息——与之结合后,分类器在六路生成器识别任务中达到78%准确率,而仅使用文本统计量为69%;其三,这些度量并不追踪文本质量的通用概念,仅显示内在维度与词汇多样性(RTTR)之间存在中等关联。我们给出了具体用例导向的建议,并将失效检测确定为最具前景的近期应用方向。