Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in\{1,2,3,4\}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B--20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.
翻译:部署大语言模型面临两个相互关联的挑战:(1) 监测——在流量和领域发生漂移时评估模型在哪些方面表现不佳;(2) 改进——优先获取数据以弥补最大的性能差距。本研究检验推理阶段的信号是否能在领域偏移下估计分片级精度。针对每个响应,我们基于最终层下一词元概率(源自top-$k$对数概率)计算输出熵分布曲线,并使用不同统计量对其进行概括。通过轻量级分类器预测实例正确性,并平均预测概率得到领域级精度估计。我们在十个STEM推理基准测试上进行了评估,涵盖详尽的训练/测试组合($k\in\{1,2,3,4\}$;所有$\binom{10}{k}$组合),并在来自六个系列(3B–20B)的九个大语言模型上测试了不同分类器模型与特征。估计结果常能跟踪保留基准测试的精度,多个模型显示出近乎单调的领域排序,这为输出熵分布曲线作为可扩展监测及定向数据采集的有效信号提供了证据。