Over the last years, topic modeling has emerged as a powerful technique for organizing and summarizing big collections of documents or searching for particular patterns in them. However, privacy concerns may arise when cross-analyzing data from different sources. Federated topic modeling solves this issue by allowing multiple parties to jointly train a topic model without sharing their data. While several federated approximations of classical topic models do exist, no research has been conducted on their application for neural topic models. To fill this gap, we propose and analyze a federated implementation based on state-of-the-art neural topic modeling implementations, showing its benefits when there is a diversity of topics across the nodes' documents and the need to build a joint model. In practice, our approach is equivalent to a centralized model training, but preserves the privacy of the nodes. Advantages of this federated scenario are illustrated by means of experiments using both synthetic and real data scenarios.
翻译:过去几年中,主题建模已成为组织和总结大规模文档集合或从中搜索特定模式的强大技术。然而,在跨来源分析数据时可能引发隐私问题。联邦主题建模通过允许多方在不共享数据的情况下联合训练主题模型,解决了这一问题。尽管存在多种经典主题模型的联邦近似方法,但尚无研究将其应用于神经主题模型。为填补这一空白,我们基于最先进的神经主题建模实现提出并分析了一种联邦实现方案,展示了当各节点文档中主题存在多样性且需要构建联合模型时的优势。实践中,我们的方法等价于集中式模型训练,但能够保护各节点的隐私。通过使用合成数据和真实数据场景的实验,阐明了这种联邦方案的优势。