从置信度到崩溃：大语言模型事实稳健性研究 (From Confidence to Collapse in LLM Factual Robustness)

Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.

翻译：确保大语言模型（LLM）中事实知识的稳健性对于其在问答和推理等任务中的可靠应用至关重要。然而，现有的评估方法主要侧重于基于性能的指标，通常从提示扰动的角度进行研究，这仅捕捉了知识稳健性外部触发的一面。为弥补这一空白，我们提出一种原则性方法，通过结合分析标记分布熵与温度缩放敏感性，从生成过程的角度衡量事实稳健性。这两个因素构建了事实稳健性分数（FRS），这是一种新颖的度量标准，用于量化给定初始不确定性的情况下，事实对解码条件扰动的稳定性。为验证我们的方法，我们在5个LLM上对3个闭卷问答数据集（SQuAD、TriviaQA和HotpotQA）进行了广泛实验。结果表明，事实稳健性存在显著差异——较小模型的FRS为$0.76$，较大模型为$0.93$——在不确定性增加的情况下，准确率下降约$60\\%$。这些发现揭示了熵和温度缩放如何影响事实准确性，并为未来模型开发更稳健的知识保持与检索机制奠定了基础。