Current automated machine learning (ML) tools are model-centric, focusing on model selection and parameter optimization. However, the majority of the time in data analysis is devoted to data cleaning and wrangling, for which limited tools are available. Here we present DataAssist, an automated data preparation and cleaning platform that enhances dataset quality using ML-informed methods. We show that DataAssist provides a pipeline for exploratory data analysis and data cleaning, including generating visualization for user-selected variables, unifying data annotation, suggesting anomaly removal, and preprocessing data. The exported dataset can be readily integrated with other autoML tools or user-specified model for downstream analysis. Our data-centric tool is applicable to a variety of fields, including economics, business, and forecasting applications saving over 50% time of the time spent on data cleansing and preparation.
翻译:当前的自动化机器学习工具以模型为中心,侧重于模型选择和参数优化。然而,数据分析的大部分时间都花在数据清洗与整理上,而可用的工具却很有限。本文提出DataAssist,一个基于机器学习方法的自动化数据准备与清洗平台,能够提升数据集质量。我们展示了DataAssist提供了一套探索性数据分析与数据清洗的流水线,包括为用户所选变量生成可视化、统一数据标注、建议异常值移除以及数据预处理。导出的数据集可直接与其他自动机器学习工具或用户指定的模型集成,用于下游分析。这一以数据为中心的工具适用于经济学、商业及预测应用等多个领域,可将数据清洗与准备的时间节省超过50%。