Temporal reasoning (TR) is a critical component of artificial intelligence, encompassing understanding and processing temporal information and relationships between events. To discover and study the TR ability in Large Language Models (LLMs), various datasets have been constructed in different ways for evaluating various aspects of TR ability. Our work proposes a novel approach to design and develop a pipeline for constructing datasets to evaluate the TR ability of LLMs by leveraging random directed graph generation, LTL formula, and the NuSMV model checker. Based on the pipeline, we have also constructed a dataset as a benchmark, namely LTLBench, consisting of 2,000 TR challenges and evaluated six LLMs with it. Furthermore, we have conducted additional experiments to discover the impact of increasing the number of events and formula operators on the complexity of TR problems and the performance of LLMs. We have demonstrated that although LLMs exhibit some promise in handling TR challenges, they still struggle with complex TR. We expect this work can offer insights into TR ability in LLMs while also providing a valuable tool for future TR evaluations.
翻译:时序推理是人工智能的关键组成部分,涵盖对时序信息及事件间关系的理解与处理。为探索和研究大型语言模型中的时序推理能力,已有研究通过不同方式构建了多种数据集以评估该能力的各个方面。本研究提出一种创新方法,通过结合随机有向图生成、线性时序逻辑公式及NuSMV模型检测器,设计并开发了一套用于构建评估LLMs时序推理能力数据集的流程框架。基于该流程,我们构建了一个包含2000个时序推理挑战的基准数据集LTLBench,并以此评估了六种大型语言模型。此外,我们通过补充实验探究了事件数量与逻辑运算符增加对时序推理问题复杂度及LLMs性能的影响。实验表明,尽管LLMs在处理时序推理挑战时展现出一定潜力,但其在复杂时序推理任务中仍存在明显局限。我们期望这项工作能为理解LLMs的时序推理能力提供新视角,同时为未来的时序推理评估提供有价值的工具。