Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we will present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Machine Learning (ML). The difference compared to classical ML consists in the fact that certain predictors are not directly observed but obtained as statistical estimators. Our goal is to study SEL, aiming to establish a formalized framework and illustrate its improved performance by means of simulations as well as applications on real life use cases.
翻译:特征工程在数据科学领域具有关键重要性。尽管数据科学家深知严格准备数据以获得高性能模型的重要性,但现有文献鲜有系统化阐述其优势。本文提出统计增强学习(SEL)方法——一种对机器学习(ML)中现有特征工程与特征提取任务的规范化框架。相较于经典ML,其核心差异在于部分预测变量并非直接观测得到,而是通过统计估计量获取。本研究旨在系统阐述SEL,建立规范化框架,并通过仿真实验及实际应用案例验证其性能提升效果。