Robust incremental learning pipelines for temporal tabular datasets with distribution shifts

In this paper, we present a robust deep incremental learning model for regression tasks on financial temporal tabular datasets. Using commonly available tabular and time-series prediction models as building blocks, a machine-learning model is built incrementally to adapt to distributional shifts in data. Using the concept of self-similarity, the model uses only a basic building block of machine learning methods, decision trees to build models of any required complexity. The model is demonstrated to have robust performances under adverse situations such as regime changes, fat-tailed distributions and low signal-to-noise ratios which is common in financial datasets. Model robustness are studied under different hyper-parameters such as model complexity and data sampling settings using XGBoost models trained on the Numerai dataset as a detailed case study. The two layer deep ensemble of XGBoost models over different model snapshots is demonstrated to deliver high quality predictions under different market regimes. Comparing the XGBoost models with different number of boosting rounds in three scenarios (small, standard and large), we demonstrated the model performances are monotonic increasing with respect to model sizes and converges towards the generalisation upper bound. Our model is efficient with much lower hardware requirement than other machine learning models as no specialised neural architectures are used and each base model can be independently trained in parallel.

翻译：本文提出了一种针对金融时序表格数据回归任务的鲁棒深度增量学习模型。利用常用表格数据与时间序列预测模型作为基础组件，该机器学习模型通过增量方式构建，以适应数据分布漂移。基于自相似性概念，该模型仅使用机器学习方法的基本构建块——决策树，即可构建任意复杂度的模型。实验表明，该模型在金融数据中常见的体制转换、重尾分布及低信噪比等不利条件下仍具有鲁棒性能。通过以Numerai数据集训练的XGBoost模型为详细案例，研究了模型复杂度与数据采样设置等不同超参数下的模型鲁棒性。采用不同模型快照的两层深度集成XGBoost模型，验证了其在多种市场体制下均能提供高质量预测。通过对比三种场景（小规模、标准及大规模）中不同提升轮数的XGBoost模型，我们证明了模型性能随模型规模单调递增，并趋于泛化上界。由于未使用专用神经架构，且每个基模型可独立并行训练，本模型硬件需求远低于其他机器学习方法，具备高效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

KDD20 | 面向时态交互网络的数据驱动图生成模型

专知会员服务

24+阅读 · 2020年9月25日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

33+阅读 · 2020年4月26日