This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedding similarity. In a case study on pragmatic language interpretation, we find that LLM predictions are not robust under variation of method choice, both within a single LLM and across different LLMs. As this variability entails pronounced researcher degrees of freedom in reporting results, knowledge of the variability is crucial to secure robustness of results and research integrity.
翻译:本文系统比较了在选择题任务中从语言模型推导项目级预测的不同方法。我们比较了基于自由生成回答的选项评分方法、多种基于概率的评分方法、李克特量表式评级方法以及嵌入相似度方法。在语用语言理解的案例研究中发现,无论是在单一语言模型内部还是不同语言模型之间,语言模型的预测结果对方法选择的变化均不具有稳健性。由于这种变异性会导致研究者在报告结果时拥有显著的自由度,因此了解这种变异性对于确保结果的稳健性和研究诚信至关重要。