Kitana: Efficient Data Augmentation Search for AutoML

AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial training data quality. With the increasing volume of tabular data available, there is a huge opportunity for data augmentation. For instance, vertical augmentation adds predictive features, while horizontal augmentation adds examples. This augmented training data yields potentially much better AutoML models at a lower cost. However, existing systems either forgo the augmentation opportunities that provide poor models, or apply expensive augmentation searching techniques that drain users' budgets. Kitana is a data-centric AutoML system that also searches for new tabular datasets that can augment the tabular training data with new features and/or examples. Kitana manages a corpus of datasets, exposes an AutoML interface to users and searches for augmentation with datasets in the corpus to improve AutoML performance. To accelerate search, Kitana applies aggressive pre-computation to train a factorized proxy model and evaluate each candidate augmentation within 0.1s. Kitana also uses a cost model to limit the time spent on augmentation search, supports expressive data access controls, and performs request caching to benefit from past similar requests. Using a corpus of 518 open-source datasets, Kitana produces higher quality models than existing AutoML systems in orders of magnitude less time. Across different user requests, Kitana increases the model R2 from 0.16 to 0.66 while reducing the cost by >100x compared to the naive factorized learning and SOTA data augmentation search.

翻译：自动机器学习服务为非专业用户提供了一种无需关注模型设计与部署即可获得高质量机器学习模型的方式，按小时计费（VertexAI 为 21.252 美元/小时）。然而，现有 AutoML 服务以模型为中心，局限于从初始训练数据中提取特征和搜索模型——其效果仅取决于初始训练数据的质量。随着表格数据量的不断增加，数据增强（Data Augmentation）蕴含巨大机遇。例如，纵向增强可添加预测特征，而横向增强可增加样本。增强后的训练数据能以更低成本生成性能更优的 AutoML 模型。然而，现有系统要么放弃增强机会导致模型效果不佳，要么应用昂贵的增强搜索技术消耗用户预算。Kitana 是一种以数据为中心的 AutoML 系统，能够自动搜索可对表格训练数据进行特征和/或样本增强的新数据集。Kitana 管理一个数据集语料库，向用户提供 AutoML 接口，并在语料库中搜索增强数据集以提升 AutoML 性能。为加速搜索，Kitana 采用激进的预计算策略训练因子化代理模型，在 0.1 秒内评估每个候选增强方案。Kitana 还通过成本模型限制增强搜索耗时，支持灵活的数据访问控制，并利用请求缓存复用历史相似请求结果。基于包含 518 个开源数据集的语料库，Kitana 生成的模型质量优于现有 AutoML 系统，且耗时降低数个数量级。针对不同用户请求，Kitana 将模型 R² 从 0.16 提升至 0.66，同时与朴素因子化学习和当前最优数据增强搜索方法相比，成本降低超过 100 倍。