martFL: Enabling Utility-Driven Data Marketplace with a Robust and Verifiable Federated Learning Architecture

The development of machine learning models requires a large amount of training data. Data marketplaces are essential for trading high-quality, private-domain data not publicly available online. However, due to growing data privacy concerns, direct data exchange is inappropriate. Federated Learning (FL) is a distributed machine learning paradigm that exchanges data utilities (in form of local models or gradients) among multiple parties without directly sharing the raw data. However, several challenges exist when applying existing FL architectures to construct a data marketplace: (i) In existing FL architectures, Data Acquirers (DAs) cannot privately evaluate local models from Data Providers (DPs) prior to trading; (ii) Model aggregation protocols in existing FL designs struggle to exclude malicious DPs without "overfitting" to the DA's (possibly biased) root dataset; (iii) Prior FL designs lack a proper billing mechanism to enforce the DA to fairly allocate the reward according to contributions made by different DPs. To address above challenges, we propose martFL, the first federated learning architecture that is specifically designed to enable a secure utility-driven data marketplace. At a high level, martFL is powered by two innovative designs: (i) a quality-aware model aggregation protocol that achieves robust local model aggregation even when the DA's root dataset is biased; (ii) a verifiable data transaction protocol that enables the DA to prove, both succinctly and in zero-knowledge, that it has faithfully aggregates the local models submitted by different DPs according to the committed aggregation weights, based on which the DPs can unambiguously claim the corresponding reward. We implement a prototype of martFL and evaluate it extensively over various tasks. The results show that martFL can improve the model accuracy by up to 25% while saving up to 64% data acquisition cost.

翻译：机器学习模型的发展需要大量训练数据。数据市场对于交易在线不可获取的高质量私有领域数据至关重要。然而，由于日益增长的数据隐私担忧，直接数据交换并不适宜。联邦学习（FL）是一种分布式机器学习范式，它通过多方之间交换数据效用（以本地模型或梯度形式）而不直接共享原始数据。然而，将现有FL架构应用于构建数据市场时面临若干挑战：（i）在现有FL架构中，数据获取方（DA）无法在交易前私下评估来自数据提供方（DP）的本地模型；（ii）现有FL设计中的模型聚合协议难以排除恶意DP，同时避免对DA（可能有偏的）根数据集"过拟合"；（iii）先前的FL设计缺乏适当的计费机制来强制DA根据不同DP的贡献公平分配奖励。为应对上述挑战，我们提出martFL，这是首个专为支持安全效用驱动数据市场而设计的联邦学习架构。从高层次看，martFL由两项创新设计驱动：（i）一种质量感知的模型聚合协议，即使在DA的根数据集存在偏差时也能实现鲁棒的本地模型聚合；（ii）一种可验证的数据交易协议，使DA能够简洁且零知识地证明其已根据承诺的聚合权重忠实地聚合了不同DP提交的本地模型，据此DP可明确申领相应奖励。我们实现了martFL原型，并在多种任务上进行了广泛评估。结果表明，martFL可将模型准确率提升高达25%，同时节省高达64%的数据获取成本。