The machine learning (ML) life cycle involves a series of iterative steps, from the effective gathering and preparation of the data, including complex feature engineering processes, to the presentation and improvement of results, with various algorithms to choose from in every step. Feature engineering in particular can be very beneficial for ML, leading to numerous improvements such as boosting the predictive results, decreasing computational times, reducing excessive noise, and increasing the transparency behind the decisions taken during the training. Despite that, while several visual analytics tools exist to monitor and control the different stages of the ML life cycle (especially those related to data and algorithms), feature engineering support remains inadequate. In this paper, we present FeatureEnVi, a visual analytics system specifically designed to assist with the feature engineering process. Our proposed system helps users to choose the most important feature, to transform the original features into powerful alternatives, and to experiment with different feature generation combinations. Additionally, data space slicing allows users to explore the impact of features on both local and global scales. FeatureEnVi utilizes multiple automatic feature selection techniques; furthermore, it visually guides users with statistical evidence about the influence of each feature (or subsets of features). The final outcome is the extraction of heavily engineered features, evaluated by multiple validation metrics. The usefulness and applicability of FeatureEnVi are demonstrated with two use cases and a case study. We also report feedback from interviews with two ML experts and a visualization researcher who assessed the effectiveness of our system.
翻译:机器学习(ML)生命周期涉及一系列迭代步骤,从有效的数据收集与准备(包括复杂的特征工程过程)到结果的呈现与改进,每一步均有多种算法可供选择。特征工程对机器学习尤为有益,可带来诸多改进,例如提升预测性能、降低计算时间、减少过度噪声、增强训练决策过程的透明度。然而,尽管已有多种可视分析工具用于监测和控制机器学习生命周期的不同阶段(尤其是与数据和算法相关的阶段),特征工程支持仍显不足。本文提出FeatureEnVi,一个专为辅助特征工程过程设计的可视分析系统。该系统帮助用户选择最重要的特征、将原始特征转化为更强大的变体,并尝试不同的特征生成组合。此外,数据空间切片功能允许用户在局部和全局尺度上探索特征的影响。FeatureEnVi利用多种自动特征选择技术,并通过统计证据直观地引导用户了解每个特征(或特征子集)的影响。最终输出经过深度工程化处理的特征,并通过多种验证指标进行评估。通过两个用例和一个案例研究,论证了FeatureEnVi的实用性和适用性。我们还报告了对两位机器学习专家和一位可视化研究人员的访谈反馈,他们评估了系统的有效性。