Large language models (LLMs) increasingly show strong performance on temporally grounded tasks, such as timeline construction, temporal question answering, and event ordering. However, it remains unclear how their behavior depends on the way time is anchored in language. In this work, we study LLMs' temporal understanding through temporal frames of reference (t-FoRs), contrasting deictic framing (past-present-future) and sequential framing (before-after). Using a large-scale dataset of real-world events from Wikidata and similarity judgement task, we examine how LLMs' outputs vary with temporal distance, interval relations, and event duration. Our results show that LLMs systematically adapt to both t-FoRs, but the resulting similarity patterns differ significantly. Under deictic t-FoR, the similarity judgement scores form graded and asymmetric structures centered on the present, with sharper decline for future events and higher variance in the past. Under sequential t-FoR, similarity becomes strongly negative once events are temporally separated. Temporal judgements are also shaped by interval algebra and duration, with instability concentrated in overlap- and containment-based relations, and duration influencing only past events under deictic t-FoR. Overall, these findings characterize how LLMs organize temporal representation under different reference structures and identify the factors that most strongly shape their temporal understanding.
翻译:大语言模型(LLMs)在时间性任务(如时间线构建、时序问答和事件排序)上日益展现出强大性能。然而,其行为如何依赖于时间在语言中的锚定方式仍不明确。本研究通过时间参照框架(t-FoRs)探究LLMs的时间理解能力,对比指示性框架(过去-现在-未来)与序列性框架(之前-之后)。利用来自Wikidata的大规模真实事件数据集和相似性判断任务,我们检验了LLMs的输出如何随时间距离、区间关系和事件持续时间而变化。结果表明,LLMs能系统性地适应两种t-FoRs,但产生的相似性模式存在显著差异。在指示性t-FoR下,相似性判断分数形成以现在为中心的分级非对称结构,未来事件的相似度衰减更陡峭,过去事件的方差更高。在序列性t-FoR下,事件一旦在时间上分离,相似性即转为强烈负相关。时间判断还受到区间代数和持续时间的影响:不稳定性集中于基于重叠与包含的关系;在指示性t-FoR下,持续时间仅影响过去事件。总体而言,这些发现揭示了LLMs在不同参照结构下组织时间表征的方式,并识别了最显著影响其时间理解的关键因素。