Consider two data providers that want to contribute data to a certain learning model. Recent works have shown that the value of the data of one of the providers is dependent on the similarity with the data owned by the other provider. It would thus be beneficial if the two providers can calculate the similarity of their data, while keeping the actual data private. In this work, we devise multiparty computation-protocols to compute similarity of two data sets based on correlation, while offering controllable privacy guarantees. We consider a simple model with two participating providers and develop methods to compute exact and approximate correlation, respectively, with controlled information leakage. Both protocols have computational and communication complexities that are linear in the number of data samples. We also provide general bounds on the maximal error in the approximation case, and analyse the resulting errors for practical parameter choices.
翻译:考虑两个希望向某一学习模型提供数据的数据提供方。近期研究表明,一方数据的价值取决于其与另一方所拥有数据的相似度。因此,若双方能在保持实际数据私密性的前提下计算各自数据的相似度,将具有显著优势。本文设计了基于相关性的多方可计算协议,用于计算两个数据集之间的相似度,同时提供可控的隐私保护。我们构建了一个包含两个参与方的简化模型,分别开发了具有可控信息泄露的精确相关度与近似相关度计算方法。这两种协议的计算复杂度与通信复杂度均与数据样本数量呈线性关系。此外,我们给出了近似情况下最大误差的通用界限,并分析了实际参数选择导致的误差结果。