Large language models (LLMs) often struggle with temporal reasoning, crucial for tasks like historical event analysis and time-sensitive information retrieval. Despite advancements, state-of-the-art models falter in handling temporal information, especially when faced with irrelevant or noisy contexts. This paper addresses this gap by empirically examining the robustness of temporal question-answering (TQA) systems trained on various context types, including relevant, irrelevant, slightly altered, and no context. Our findings indicate that training with a mix of these contexts enhances model robustness and accuracy. Additionally, we show that the position of context relative to the question significantly impacts performance, with question-first positioning yielding better results. We introduce two new context-rich TQA datasets, ContextAQA and ContextTQE, and provide comprehensive evaluations and guidelines for training robust TQA models. Our work lays the foundation for developing reliable and context-aware temporal QA systems, with broader implications for enhancing LLM robustness against diverse and potentially adversarial information.
翻译:大型语言模型(LLM)在时序推理方面常常面临困难,而时序推理对于历史事件分析和时效性信息检索等任务至关重要。尽管技术不断进步,最先进的模型在处理时序信息时仍然表现不佳,尤其是在面对无关或噪声上下文时。本文通过实证研究,探讨了在不同上下文类型(包括相关、无关、轻微修改和无上下文)上训练的时序问答(TQA)系统的鲁棒性,以填补这一研究空白。我们的研究结果表明,混合使用这些上下文进行训练可以提高模型的鲁棒性和准确性。此外,我们发现上下文相对于问题的位置对性能有显著影响,问题优先的排列方式能带来更好的结果。我们引入了两个新的上下文丰富的TQA数据集——ContextAQA和ContextTQE,并提供了全面的评估和训练鲁棒TQA模型的指导原则。我们的工作为开发可靠且具有上下文感知能力的时序问答系统奠定了基础,并对增强LLM在面对多样化和潜在对抗性信息时的鲁棒性具有更广泛的意义。