Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets.
翻译:大型语言模型(LLM)通常在广泛且时间上不加区分的文本语料库上进行训练,这反映出缺乏带有时间元数据的数据集。这种方法与语言的演化特性并不相符。传统上创建时间适应型语言模型的方法通常依赖于在特定时间数据上对静态模型进行进一步预训练。本文提出了一种新方法:一系列称为时间机器GPT(TiMaGPT)的时间点语言模型,这些模型被专门设计为非预测性模型,以确保它们不掌握关于未来事实信息和语言变化的知识。这种策略有助于理解语言演化,并且在动态环境(如时间序列预测)中应用模型时具有关键意义——因为在这些场景中,预知未来信息可能会带来问题。我们提供了模型和训练数据集的访问权限。