Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.
翻译:用连续时间表示法对基于大语言模型(LLM)的时间事件序列进行建模是一项关键且研究不足的挑战。现有研究提出了字节级表示、日历令牌等多种策略,但由于真实世界事件数据呈现从平滑对数正态分布到离散尖峰分布的多样化统计特征,最优方法仍未明确。本文对LLM事件序列建模中的时间令牌化策略进行了系统性实证研究,系统对比了五种差异化编码策略:朴素数字字符串、高精度字节级表示、人语义日历令牌、经典均匀分箱以及自适应残差标量量化。通过在实际数据集上微调LLM并评估这些策略,我们的分析揭示:不存在普适最优策略,预测性能高度依赖于令牌化策略与数据统计特性的匹配程度。这一发现表明,时间令牌化是LLM事件建模中至关重要却常被忽视的设计维度。