martFL: Enabling Utility-Driven Data Marketplace with a Robust and Verifiable Federated Learning Architecture

The development of machine learning models requires a large amount of training data. Data marketplaces are essential for trading high-quality, private-domain data not publicly available online. However, due to growing data privacy concerns, direct data exchange is inappropriate. Federated Learning (FL) is a distributed machine learning paradigm that exchanges data utilities (in form of local models or gradients) among multiple parties without directly sharing the raw data. However, several challenges exist when applying existing FL architectures to construct a data marketplace: (i) In existing FL architectures, Data Acquirers (DAs) cannot privately evaluate local models from Data Providers (DPs) prior to trading; (ii) Model aggregation protocols in existing FL designs struggle to exclude malicious DPs without "overfitting" to the DA's (possibly biased) root dataset; (iii) Prior FL designs lack a proper billing mechanism to enforce the DA to fairly allocate the reward according to contributions made by different DPs. To address above challenges, we propose martFL, the first federated learning architecture that is specifically designed to enable a secure utility-driven data marketplace. At a high level, martFL is powered by two innovative designs: (i) a quality-aware model aggregation protocol that achieves robust local model aggregation even when the DA's root dataset is biased; (ii) a verifiable data transaction protocol that enables the DA to prove, both succinctly and in zero-knowledge, that it has faithfully aggregates the local models submitted by different DPs according to the committed aggregation weights, based on which the DPs can unambiguously claim the corresponding reward. We implement a prototype of martFL and evaluate it extensively over various tasks. The results show that martFL can improve the model accuracy by up to 25% while saving up to 64% data acquisition cost.

翻译：机器学习模型的开发需要大量训练数据。数据市场对于交易网上未公开的高质量私有域数据至关重要。然而，随着数据隐私问题日益突出，直接交换数据已不适宜。联邦学习是一种分布式机器学习范式，允许多方在不直接共享原始数据的情况下交换数据效用（以局部模型或梯度形式）。然而，将现有联邦学习架构应用于构建数据市场时面临若干挑战：（i）在现有联邦学习架构中，数据获取方在交易前无法私下评估数据提供方的局部模型；（ii）现有联邦学习设计中的模型聚合协议难以排除恶意数据提供方，同时避免"过拟合"数据获取方的（可能有偏的）根数据集；（iii）现有联邦学习设计缺乏合理的计费机制，难以强制数据获取方根据各数据提供方的贡献公平分配奖励。针对上述挑战，我们提出martFL，这是首个专为构建安全效用驱动数据市场而设计的联邦学习架构。从高层看，martFL基于两项创新设计：（i）一种质量感知模型聚合协议，即使数据获取方根数据集存在偏差，也能实现鲁棒的局部模型聚合；（ii）一种可验证数据交易协议，使数据获取方能以简洁且零知识的方式证明其严格按照承诺的聚合权重对各数据提供方提交的局部模型进行了忠实聚合，据此数据提供方可明确获取相应奖励。我们实现了martFL原型并在多种任务上进行了广泛评估。结果表明，martFL可将模型准确率提升高达25%，同时节省高达64%的数据获取成本。