Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
翻译:大型语言模型在常识任务上展现出令人瞩目的性能;然而,这些任务通常以多项选择题的形式呈现,使得模型可能利用系统性偏差。常识本质上也是概率性的,往往存在多个正确答案。例如,“烧开水”的目的可能是泡茶或烹饪,但也可能是杀菌。现有任务未能捕捉常识的概率特性。为此,我们提出常识框架补全(CFC),这是一种通过多轮开放式生成来评估常识的新型生成式任务。我们还提出一种概率评估方法,其与人类判断结果高度相关。在我们的数据集上,人类表现显著优于强大的语言模型基线,表明该方法对机器常识的评估既具挑战性又具有实用价值。