This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT's promising performance across various tasks, we proceed to carry out thorough evaluations on the whole test sets of 11 datasets, including temporal and causal relations, PDTB2.0-based, and dialogue-based discourse relations. To ensure the reliability of our findings, we employ three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time. Through our study, we discover that ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations, albeit it may not possess the same level of expertise in identifying the temporal order between two events. While it is capable of identifying the majority of discourse relations with existing explicit discourse connectives, the implicit discourse relation remains a formidable challenge. Concurrently, ChatGPT demonstrates subpar performance in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.
翻译:本文旨在定量评估交互式大语言模型ChatGPT在句间关系(如时间关系、因果关系和话语关系)上的表现。鉴于ChatGPT在各类任务中的良好性能,我们在11个数据集的完整测试集上进行了全面评估,涵盖时间与因果关系、基于PDTB2.0的话语关系以及基于对话的话语关系。为确保研究结果的可靠性,我们为每项任务设计了三种定制提示模板,包括零样本提示模板、零样本提示工程模板和上下文学习提示模板,首次为所有流行的句对关系分类任务建立了初始基线分数。通过研究,我们发现ChatGPT在检测和推理因果关系方面表现出卓越的能力,但在识别两个事件的时间顺序方面可能不具备同等水平。尽管它能够识别存在显性话语连接词的绝大多数话语关系,但隐性话语关系仍是一大挑战。同时,ChatGPT在需要先理解对话结构再识别话语关系的对话话语解析任务中表现欠佳。