Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
翻译:大语言模型(LLMs)很少承认不确定性,往往生成流畅但具有误导性的答案,而非拒绝回答。这一弱点在时序问答中尤为明显,模型经常忽略时间敏感的证据,并将不同时期的事实混为一谈。本文首次对训练LLMs在时序问答推理中具备拒绝回答能力进行了实证研究。现有方法(如校准)在捕捉复杂推理中的不确定性方面可能不可靠。我们转而将拒绝回答定义为一种可教授的技能,并提出一个将思维链监督与基于拒绝感知奖励的强化学习相结合的训练流程。我们的目标是系统分析不同类型的信息和训练技术如何影响LLMs在时序推理中的拒绝回答行为。通过研究多种方法的广泛实验,我们发现强化学习在推理方面带来了显著的实证提升:以Qwen2.5-1.5B-Instruct初始化的模型在TimeQA-Easy和Hard数据集上的精确匹配率分别超过GPT-4o $3.46\%$ 和 $5.80\%$。此外,与纯监督微调变体相比,其在不可回答问题上的真阳性率提高了$20\%$。除了性能表现,我们的分析表明监督微调会引发过度自信并损害可靠性,而强化学习虽提高了预测准确性但也表现出类似风险。最后,通过比较隐式推理线索(如原始上下文、时序子上下文、知识图谱)与显式思维链监督,我们发现隐式信息对具备拒绝回答能力的推理提供的益处有限。本研究为如何联合优化拒绝回答与推理提供了新的见解,为构建更可靠的LLMs奠定了基础。