martFL: Enabling Utility-Driven Data Marketplace with a Robust and Verifiable Federated Learning Architecture

The development of machine learning models requires a large amount of training data. Data marketplaces are essential for trading high-quality, private-domain data not publicly available online. However, due to growing data privacy concerns, direct data exchange is inappropriate. Federated Learning (FL) is a distributed machine learning paradigm that exchanges data utilities (in form of local models or gradients) among multiple parties without directly sharing the raw data. However, several challenges exist when applying existing FL architectures to construct a data marketplace: (i) In existing FL architectures, Data Acquirers (DAs) cannot privately evaluate local models from Data Providers (DPs) prior to trading; (ii) Model aggregation protocols in existing FL designs struggle to exclude malicious DPs without "overfitting" to the DA's (possibly biased) root dataset; (iii) Prior FL designs lack a proper billing mechanism to enforce the DA to fairly allocate the reward according to contributions made by different DPs. To address above challenges, we propose martFL, the first federated learning architecture that is specifically designed to enable a secure utility-driven data marketplace. At a high level, martFL is powered by two innovative designs: (i) a quality-aware model aggregation protocol that achieves robust local model aggregation even when the DA's root dataset is biased; (ii) a verifiable data transaction protocol that enables the DA to prove, both succinctly and in zero-knowledge, that it has faithfully aggregates the local models submitted by different DPs according to the committed aggregation weights, based on which the DPs can unambiguously claim the corresponding reward. We implement a prototype of martFL and evaluate it extensively over various tasks. The results show that martFL can improve the model accuracy by up to 25% while saving up to 64% data acquisition cost.

翻译：机器学习模型的开发需要大量训练数据。数据市场对于交易高质量且非公开可用的私域数据至关重要。然而，随着数据隐私问题的日益突出，直接进行数据交换并不合适。联邦学习（Federated Learning, FL）是一种分布式机器学习范式，它通过在多参与方之间交换数据效用（以局部模型或梯度形式），而无需直接共享原始数据。然而，将现有FL架构应用于构建数据市场时面临若干挑战：（i）现有FL架构中，数据获取方（Data Acquirers, DAs）在交易前无法私下评估数据提供方（Data Providers, DPs）的局部模型；（ii）现有FL设计中的模型聚合协议难以在不“过拟合”DA的（可能偏倚的）根数据集的情况下排除恶意DPs；（iii）现有FL设计缺乏合理的计费机制来强制DA根据不同DPs的贡献公平分配奖励。为应对上述挑战，我们提出martFL，这是首个专门设计用于实现安全效用驱动数据市场的联邦学习架构。在高层设计上，martFL依赖于两项创新方案：（i）一种质量感知的模型聚合协议，即使在DA的根数据集存在偏倚时也能实现鲁棒的局部模型聚合；（ii）一种可验证的数据交易协议，使DA能够以简洁且零知识的方式证明其已按照承诺的聚合权重诚实地聚合不同DPs提交的局部模型，基于此，DPs可明确主张相应奖励。我们实现了martFL的原型系统，并在多种任务上进行了广泛评估。结果表明，martFL可将模型准确率提升最高25%，同时节省最高64%的数据获取成本。