Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.
翻译:大型语言模型(LLMs)越来越多地应用于需要解释和推理准确性的任务中。本文提出ExpliCa,这是一个用于评估LLMs在显式因果推理能力的新数据集。ExpliCa独特地整合了以不同语言顺序呈现、并通过语言连接词明确表达的因果与时间关系。该数据集通过众包收集的人类可接受度评分进行了丰富。我们通过提示和基于困惑度的指标在ExpliCa上测试了LLMs。我们评估了七个商业和开源LLMs,结果表明即使顶级模型也难以达到0.80的准确率。有趣的是,模型倾向于混淆时间关系与因果关系,且其性能也受到事件语言顺序的强烈影响。最后,基于困惑度的评分和提示性能受模型规模的影响方式存在差异。