Large language models (LLMs) have shown impressive achievements in solving a broad range of tasks. Augmented by instruction fine-tuning, LLMs have also been shown to generalize in zero-shot settings as well. However, whether LLMs closely align with the human disagreement distribution has not been well-studied, especially within the scope of natural language inference (NLI). In this paper, we evaluate the performance and alignment of LLM distribution with humans using two different techniques to estimate the multinomial distribution: Monte Carlo Estimation (MCE) and Log Probability Estimation (LPE). As a result, we show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution. The inference and human alignment performances plunge even further on data samples with high human disagreement levels, raising concerns about their natural language understanding (NLU) ability and their representativeness to a larger human population. The source code for the experiments is available at https://github.com/xfactlab/emnlp2023-LLM-Disagreement
翻译:大型语言模型(LLMs)在解决广泛任务中展现出显著成就。通过指令微调的增强,LLMs还被证明能够在零样本设置下进行泛化。然而,LLMs是否与人类不同意分布紧密对齐尚未得到充分研究,尤其是在自然语言推理(NLI)领域。本文采用两种不同技术——蒙特卡洛估计(MCE)和对数概率估计(LPE)来估计多项分布,评估LLMs的性能及其与人类对齐程度。结果表明,LLMs在解决NLI任务时能力有限,同时未能捕捉到人类不同意分布。在人类不同意程度较高的数据样本上,推理性能和人类对齐程度进一步下降,这引发了对其自然语言理解(NLU)能力及对更广泛人群代表性的担忧。实验源代码可在 https://github.com/xfactlab/emnlp2023-LLM-Disagreement 获取。