Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models, requiring significant expertise and time investment to search the huge search space of well-suited data curation and transformation tools. To address this challenge, we present AutoCure, a novel and configuration-free data curation pipeline that improves the quality of tabular data. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction through an adaptive ensemble-based error detection method and a data augmentation module. In practice, AutoCure can be integrated with open source tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of machine learning. As a proof of concept, we provide a comparative evaluation of AutoCure against 28 combinations of traditional data curation tools, demonstrating superior performance and predictive accuracy without user intervention. Our evaluation shows that AutoCure is an effective approach to automating data preparation and improving the accuracy of machine learning models.
翻译:机器学习算法在自动驾驶、医疗健康和金融等多个领域日益普及。在这些领域中,数据准备仍是开发精确模型的一项重大挑战,需要投入大量专业知识和时间来搜索庞大且合适的非数据处理与转换工具空间。为解决这一问题,我们提出了AutoCure——一种无需配置的新型数据处理流水线,可提升表格数据质量。与传统数据处理方法不同,AutoCure通过基于自适应集成的错误检测方法和数据增强模块,综合提升干净数据部分的密度。在实践中,AutoCure可与Auto-sklearn、H2O和TPOT等开源工具集成,促进机器学习的民主化。作为概念验证,我们将AutoCure与28种传统数据处理工具组合进行了对比评估,结果表明其在无需用户干预的情况下表现出卓越的性能和预测精度。评估显示,AutoCure是一种自动化数据准备并提升机器学习模型精度的有效方法。