Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners possess, whereas these data owners sit on piles of data without knowing who can benefit from it. This gap creates the opportunity for building an online market that can automatically connect supply with demand. While online matching markets are prevalent (e.g., ride-hailing systems), designing a data-centric market for ML exhibits many unprecedented challenges. This paper develops new techniques to tackle two core challenges in designing such a market: (a) to efficiently match demand with supply, we design an algorithm to automatically discover useful data for any ML task from a pool of thousands of datasets, achieving high-quality matching between ML models and data; (b) to encourage market participation of ML users without much ML expertise, we design a new pricing mechanism for selling data-augmented ML models. Furthermore, our market is designed to be API-compatible with existing online ML markets like Vertex AI and Sagemaker, making it easy to use while providing better results due to joint data and model search. We envision that the synergy of our data and model discovery algorithm and pricing mechanism will be an important step towards building a new data-centric online market that serves ML users effectively.
翻译:数据驱动机器学习(ML)——丰富且高质量的训练数据对ML的成功至关重要。然而,要将ML从少数大型企业的竞争转变为服务于众多普通用户数据分析需求的可及技术,仍存在重要挑战。我们观察到的一个关键缺口是:许多ML用户可以从其他数据持有者拥有的新数据中获益,而这些数据持有者坐拥海量数据却不知谁能从中受益。这一缺口为构建能够自动连接供需的在线市场创造了机遇。尽管在线匹配市场已十分普遍(例如网约车系统),但为ML设计数据中心市场仍面临诸多前所未有的挑战。本文开发了应对此类市场两大核心挑战的新技术:(a)为高效匹配供需,我们设计了一种算法,能够从包含数千个数据集的数据池中自动发现适用于任意ML任务的有用数据,实现ML模型与数据的高质量匹配;(b)为鼓励缺乏ML专业知识的用户参与市场,我们设计了一种数据增强型ML模型销售的新定价机制。此外,我们的市场设计兼容Vertex AI和SageMaker等现有在线ML市场的API接口,在通过联合数据与模型搜索获得更优结果的同时,保持易用性。我们预期,本文提出的数据与模型发现算法与定价机制的协同作用,将成为构建高效服务ML用户的新型数据中心在线市场的重要一步。