martFL: Enabling Utility-Driven Data Marketplace with a Robust and Verifiable Federated Learning Architecture

The development of machine learning models requires a large amount of training data. Data marketplaces are essential for trading high-quality, private-domain data not publicly available online. However, due to growing data privacy concerns, direct data exchange is inappropriate. Federated Learning (FL) is a distributed machine learning paradigm that exchanges data utilities (in form of local models or gradients) among multiple parties without directly sharing the raw data. However, several challenges exist when applying existing FL architectures to construct a data marketplace: (i) In existing FL architectures, Data Acquirers (DAs) cannot privately evaluate local models from Data Providers (DPs) prior to trading; (ii) Model aggregation protocols in existing FL designs struggle to exclude malicious DPs without "overfitting" to the DA's (possibly biased) root dataset; (iii) Prior FL designs lack a proper billing mechanism to enforce the DA to fairly allocate the reward according to contributions made by different DPs. To address above challenges, we propose martFL, the first federated learning architecture that is specifically designed to enable a secure utility-driven data marketplace. At a high level, martFL is powered by two innovative designs: (i) a quality-aware model aggregation protocol that achieves robust local model aggregation even when the DA's root dataset is biased; (ii) a verifiable data transaction protocol that enables the DA to prove, both succinctly and in zero-knowledge, that it has faithfully aggregates the local models submitted by different DPs according to the committed aggregation weights, based on which the DPs can unambiguously claim the corresponding reward. We implement a prototype of martFL and evaluate it extensively over various tasks. The results show that martFL can improve the model accuracy by up to 25% while saving up to 64% data acquisition cost.

翻译：机器学习模型的开发需要大量训练数据。数据市场对于交易在线非公开的高质量私域数据至关重要。然而，由于日益增长的数据隐私担忧，直接交换数据并不合适。联邦学习是一种分布式机器学习范式，它通过交换数据效用（以局部模型或梯度形式）在多参与方之间实现协作，而无需直接共享原始数据。然而，将现有联邦学习架构应用于构建数据市场存在若干挑战：（i）现有联邦学习架构中，数据收购方无法在交易前私下评估数据提供方的局部模型；（ii）现有联邦学习设计的模型聚合协议难以在不"过拟合"数据收购方（可能具有偏性）根数据集的情况下排除恶意数据提供方；（iii）现有联邦学习设计缺乏恰当的计费机制，以强制数据收购方根据不同数据提供方的贡献公平分配奖励。为应对上述挑战，我们提出martFL——首个专为构建安全效用驱动数据市场而设计的联邦学习架构。宏观层面，martFL由两项创新设计驱动：（i）一种质量感知的模型聚合协议，即使在数据收购方根数据集存在偏性时也能实现鲁棒的局部模型聚合；（ii）一种可验证的数据交易协议，使数据收购方能够简洁且零知识地证明其已按照承诺的聚合权重忠实地聚合了不同数据提供方提交的局部模型，数据提供方据此可明确主张相应奖励。我们实现了martFL原型系统，并在多种任务上进行了广泛评估。结果表明，martFL可将模型准确率提升高达25%，同时节省高达64%的数据采集成本。