VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values

from arxiv, An updated verson of this paper has been accepted for publication in Springer Complex & Intelligent Systems. The final version can be found here: https://link.springer.com/article/10.1007/s40747-024-01424-0

Federated learning makes it possible to train a machine learning model on decentralized data. Bayesian networks are probabilistic graphical models that have been widely used in artificial intelligence applications. Their popularity stems from the fact they can be built by combining existing expert knowledge with data and are highly interpretable, which makes them useful for decision support, e.g. in healthcare. While some research has been published on the federated learning of Bayesian networks, publications on Bayesian networks in a vertically partitioned or heterogeneous data setting (where different variables are located in different datasets) are limited, and suffer from important omissions, such as the handling of missing data. In this article, we propose a novel method called VertiBayes to train Bayesian networks (structure and parameters) on vertically partitioned data, which can handle missing values as well as an arbitrary number of parties. For structure learning we adapted the widely used K2 algorithm with a privacy-preserving scalar product protocol. For parameter learning, we use a two-step approach: first, we learn an intermediate model using maximum likelihood by treating missing values as a special value and then we train a model on synthetic data generated by the intermediate model using the EM algorithm. The privacy guarantees of our approach are equivalent to the ones provided by the privacy preserving scalar product protocol used. We experimentally show our approach produces models comparable to those learnt using traditional algorithms and we estimate the increase in complexity in terms of samples, network size, and complexity. Finally, we propose two alternative approaches to estimate the performance of the model using vertically partitioned data and we show in experiments that they lead to reasonably accurate estimates.

翻译：联邦学习使得在分布式数据上训练机器学习模型成为可能。贝叶斯网络作为一种概率图模型，已广泛应用于人工智能领域。其优势在于能够结合现有专家知识与数据进行构建，且具备高度可解释性，因此适用于医疗等领域的决策支持。尽管已有研究探讨贝叶斯网络的联邦学习，但在垂直分区或异构数据场景（即不同变量分布于不同数据集）下的贝叶斯网络研究仍十分有限，且存在重要缺陷，例如缺失数据处理问题。本文提出一种名为VertiBayes的新方法，用于在垂直分区数据上训练贝叶斯网络（包含结构与参数），该方法能够处理缺失值及任意数量的参与方。对于结构学习，我们采用隐私保护标量积协议对广泛使用的K2算法进行了适配。对于参数学习，我们采用两步策略：首先通过将缺失值视为特殊值，利用极大似然估计学习中间模型；随后基于该中间模型生成的合成数据，采用EM算法训练最终模型。本方法的隐私保障等同于所采用的隐私保护标量积协议所提供的安全性。实验表明，我们的方法生成的模型与传统算法学习得到的模型性能相当，并进一步估算了样本量、网络规模及复杂度增加带来的计算开销。最后，我们提出两种替代方法来评估模型在垂直分区数据上的性能，实验证明这两种方法均能获得较为准确的估计结果。