In this study, we delve into the realm of counterfactual reasoning capabilities of large language models (LLMs). Our primary objective is to cultivate the counterfactual thought processes within LLMs and rigorously assess these processes for their validity. Specifically, we introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. To effectively evaluate a generation model's counterfactual capabilities, we propose an innovative evaluation metric, the decomposed Self-Evaluation Score (SES) to directly evaluate the natural language output of LLMs instead of modeling the task as a multiple-choice problem. Analysis shows that the proposed automatic metric aligns well with human preference. Our experimental results show that while LLMs demonstrate a notable capacity for logical counterfactual thinking, there remains a discernible gap between their current abilities and human performance. Code and data are available at https://github.com/Eleanor-H/CLOMO.
翻译:本研究深入探讨了大语言模型(LLMs)的反事实推理能力。我们的主要目标是在LLMs中培养反事实思维过程,并严格评估这些过程的有效性。具体而言,我们引入了一项新颖的任务——反事实逻辑修改(CLOMO),以及一个高质量的人工标注基准。在此任务中,LLMs必须熟练地修改给定的议论文本,以维持预先设定的逻辑关系。为了有效评估生成模型的反事实能力,我们提出了一种创新的评估指标——分解式自评估分数(SES),该指标直接评估LLMs的自然语言输出,而非将任务建模为多项选择题。分析表明,所提出的自动评估指标与人类偏好高度一致。我们的实验结果表明,尽管LLMs展现出显著的反事实逻辑思维能力,但其当前能力与人类表现之间仍存在明显差距。代码与数据可在 https://github.com/Eleanor-H/CLOMO 获取。