Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as semiparametric inference for a low-rank latent score tensor observed through pairwise comparisons under Bradley-Terry-Luce-type models. This places LLM evaluation in a new tensor completion setting with structured observations, non-uniform sampling, and pairwise contrasts. Our target is a smooth functional $ψ(T^\star)$, including linear estimands such as ability gaps and nonlinear ones such as win probabilities. We derive the information operator on the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound, then construct a one-step debiased estimator with asymptotic normality. A central challenge is that the information operator is anisotropic and does not commute with the tangent-space projection, creating a bottleneck absent from isotropic models. We introduce a score-whitening method that equalizes local Fisher information and restores stable inference at the optimal sample-complexity scale. Our results provide a principled framework for uncertainty quantification in LLM evaluation and more broadly for inference on low-rank structures from pairwise data.
翻译:大语言模型(LLM)评估平台日益依赖成对人工判断。这些数据存在噪声、稀疏且非均匀分布,但排行榜却仅提供有限的不确定性量化。我们将其研究为基于Bradley-Terry-Luce型模型下通过成对比较观测到的低秩潜在评分张量的半参数推断。这使LLM评估进入一种新型张量补全场景——包含结构化观测、非均匀采样和成对对比。我们的目标函数为光滑泛函$ψ(T^\star)$,涵盖能力差距等线性估计量及胜率等非线性估计量。我们推导了低秩切空间上的信息算子、有效影响函数和半参效率界,进而构造具有渐近正态性的单步去偏估计量。核心挑战在于信息算子具有各向异性且与切空间投影不可交换,这形成了各向同性模型所没有的瓶颈。我们引入分数白化方法均衡局部Fisher信息,在最优样本复杂度尺度上恢复稳定推断。研究结果为LLM评估中的不确定性量化提供了理论框架,并更广泛地适用于从成对数据推断低秩结构的场景。