大型语言模型能否捕捉稳定的人类生成句子熵度量？ (Can LLMs capture stable human-generated sentence entropy measures?)

Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (<1) required as few as 20 responses and high-entropy sentences (>2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.

翻译：预测即将出现的词语是语言理解的核心机制，可通过香农熵进行量化。目前关于在词语层面获得稳定且无偏的熵估计需要多少人类反应数据，尚未形成实证共识。此外，大型语言模型（LLMs）越来越多地被用作人类规范数据的替代品，但其能否再现稳定的人类熵值仍不明确。本文利用两个大型公开完形填空数据集（德语1和英语2）同时探讨这两个问题。我们实施了基于自助法的收敛性分析，以追踪熵估计如何随样本量增加而趋于稳定。在两种语言中，超过97%的句子在可用样本量范围内达到了稳定的熵估计。90%的句子在德语中经过111次反应、在英语中经过81次反应后收敛，而低熵句子（<1）仅需20次反应，高熵句子（>2.5）则需要显著更多反应。这些发现首次为常见的规范实践提供了直接的实证验证，并证明收敛性关键取决于句子的可预测性。随后，我们将稳定的人类熵值与多个LLM（包括GPT-4o）的熵估计进行了比较，使用了基于对数概率提取和基于采样的频率估计两种方法，对比模型包括GPT2-xl/german-GPT-2、RoBERTa Base/GottBERT和LLaMA 2 7B Chat。GPT-4o与人类数据的对应度最高，但其对齐效果强烈依赖于提取方法和提示设计。基于对数概率的估计最小化了绝对误差，而基于采样的估计在捕捉人类变异性分布方面表现更优。综合而言，我们的研究结果为人类规范实践建立了实用指南，并表明虽然LLMs能够近似人类熵值，但它们无法与稳定的人类衍生分布完全互换。