Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
翻译:近期研究通常通过向大语言模型(LLMs)输入调查问题并将其回答与人类平均回答进行比较,以此评估大语言模型的价值取向。本文指出该方法存在局限性,具体设置可能导致对价值取向相似性的低估或高估。基于五个国家、三种语言的世界价值观调查数据,我们证明提示方法(直接提示与思维链提示)和解码策略(贪婪解码与采样解码)会显著影响结果。为评估答案间的相互作用,我们引入了一种新颖的指标——自相关距离。该指标用于衡量大语言模型是否能像人类一样在不同问题间保持答案关系的稳定性。研究表明,即使大语言模型的回答在独立考量时与人类数据具有较高的平均一致性,也不能保证其回答结构具有对应性。此外,我们发现两种常用评估指标——均方距离与KL散度——之间存在弱相关性,这两种指标均假设调查答案相互独立。对于未来研究,我们建议采用思维链提示、基于数十次样本的采样解码,以及包含自相关距离在内的多指标稳健性分析。