Many real-world applications of tabular data involve using historic events to predict properties of new ones, for example whether a credit card transaction is fraudulent or what rating a customer will assign a product on a retail platform. Existing approaches to event prediction include costly, brittle, and application-dependent techniques such as time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance. Moreover, these approaches often assume specific use-cases, for example that we know the labels of all historic events or that we only predict a pre-specified label and not the data's features themselves. In this work, we propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective. Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases. We demonstrate that the same model can predict labels, impute missing values, or model event sequences.
翻译:许多表格数据的实际应用涉及利用历史事件来预测新事件的属性,例如判断信用卡交易是否欺诈,或预测顾客在零售平台上对产品的评分。现有的事件预测方法通常采用成本高昂、鲁棒性差且依赖具体应用场景的技术,例如时间感知位置编码、学习得到的行与字段编码,以及用于处理类别不平衡的过采样方法。此外,这些方法往往基于特定使用场景的假设,例如已知所有历史事件的标签,或仅预测预设标签而不涉及数据特征本身。本研究提出一种简单而灵活的基线方法,采用标准的自回归LLM风格Transformer,结合基础位置编码和因果语言建模目标。该基线在多个常用数据集上优于现有方法,并适用于多种应用场景。我们证明,同一模型能够完成标签预测、缺失值填补和事件序列建模等任务。