In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.
翻译:在大数据和可解释机器学习时代,方法需要能够大规模运行,同时对其内部工作机制有清晰的数学理解。虽然存在固有可解释的半参数回归技术用于大规模应用以解释数据中的非线性,但这些模型的复杂性仍然常常受到限制。主要限制之一是模型中缺失的交互项——这些交互项不仅为了更好的可解释性而被排除,还因为计算成本过高。为解决这一缺陷,我们利用因子分解方法推导出可扩展的高阶张量积样条模型。我们的方法能够包含所有(高阶)非线性特征效应的交互项,同时计算成本与不含交互项的模型相当。我们从理论和实验两方面证明该方法在可扩展性上显著优于现有方法,推导出有意义的惩罚方案,并进一步讨论相关理论方面的问题。最后,我们使用合成数据和真实数据评估其预测和估计性能。