Data privacy and ownership are significant in social data science, raising legal and ethical concerns. Sharing and analyzing data is difficult when different parties own different parts of it. An approach to this challenge is to apply de-identification or anonymization techniques to the data before collecting it for analysis. However, this can reduce data utility and increase the risk of re-identification. To address these limitations, we present PADME, a distributed analytics tool that federates model implementation and training. PADME uses a federated approach where the model is implemented and deployed by all parties and visits each data location incrementally for training. This enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location. Training the model on data in its original location preserves data ownership. Furthermore, the results are not provided until the analysis is completed on all data locations to ensure privacy and avoid bias in the results.
翻译:数据隐私和所有权在社会数据科学中具有重要意义,引发了法律和伦理方面的关注。当不同方持有不同部分的数据时,数据共享与分析变得困难。解决这一挑战的一种方法是在收集数据进行分析之前,对数据应用去标识化或匿名化技术。然而,这可能会降低数据效用并增加重识别的风险。为克服这些局限性,我们提出了PADME——一种分布式分析工具,它能实现模型实现与训练的联邦化。PADME采用联邦方法,由所有参与方共同实现和部署模型,并逐步访问每个数据位置进行训练。这使得数据能够跨位置进行分析,同时模型仍可像所有数据集中于一处那样进行训练。由于模型在数据的原始位置进行训练,因此保留了数据所有权。此外,在所有数据位置的分析完成之前,不会提供结果,以确保隐私并避免结果偏差。