We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the effectiveness did not extend to larger industry-scale datasets. The challenges identified include handling high-dimensional data, the necessity for efficient pre-processing of categorical and numerical features, and addressing substantial computational requirements. To overcome the identified challenges, the study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express. The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance. Focusing on temporal financial data modeling, the research aims to facilitate the systematic development and deployment of transformer-based models in real-world scenarios, emphasizing scalability.
翻译:我们对用于表格数据建模的Transformer模型进行了比较分析,特别是针对一个工业规模数据集。尽管早期研究在较小的公共或合成数据集上展示了有希望的结果,但其有效性并未扩展到更大的工业规模数据集。已识别的挑战包括处理高维数据、对分类特征和数值特征进行高效预处理的必要性,以及应对巨大的计算需求。为了克服这些挑战,本研究使用合成数据集和美国运通公司的默认预测Kaggle数据集(2022),对各种基于Transformer的模型进行了广泛检验。本文提出了关于最优数据预处理的关键见解,比较了预训练和直接监督学习方法,讨论了管理分类特征和数值特征的策略,并强调了计算资源与性能之间的权衡。本研究聚焦于时间序列金融数据的建模,旨在促进基于Transformer模型在实际场景中的系统开发和部署,并强调可扩展性。