Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data. In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.
翻译:摘要:数据是机器学习和因果推断任务的核心组成部分。来自开放数据存储库、数据湖和数据市场等来源的大量数据的可得性,为增强数据并提升上述任务的性能创造了机遇。然而,现有的增强技术依赖于用户手动发现并初步筛选出有用的候选增强方案。现有解决方案未能充分利用发现与增强之间的协同作用,从而未能充分挖掘数据潜力。在本文中,我们提出METAM,一种新颖的面向目标框架。该框架用于候选数据集查询下游任务,形成反馈循环,自动引导发现与增强过程。为高效选择候选数据集,METAM利用了以下特性:i)数据,ii)效用函数,以及iii)解集规模。我们展示了METAM的理论保证,并在广泛的任务集合上进行了实证验证。总体而言,我们展示了面向目标的数据发现对现代数据科学应用的潜力。