This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.
翻译:本文首次针对生物医学文本的零样本时序关系抽取展开研究。我们采用两种提示类型和五种大语言模型(GPT-3.5、Mixtral、Llama 2、Gemma 及 PMC-LLaMA)来获取关于两个事件间时序关系的响应。实验结果表明,在零样本设定下,大语言模型的表现存在困难,其 F1 分数低于经过微调的专用模型,这表明该任务对大语言模型具有挑战性。我们进一步通过计算各模型的时序一致性分数,提出了一种新颖的综合时序分析方法。研究发现,大语言模型在提供符合时序唯一性和传递性特性的响应方面面临挑战。此外,我们探究了大语言模型的时序一致性与准确性之间的关系,以及是否可通过解决时序不一致性来提升准确性。分析表明,即使实现了时序一致性,预测结果仍可能不准确。