This paper proposes a distributed version of Determinant Point Processing (DPP) inference to enhance multi-source data diversification under limited communication bandwidth. DPP is a popular probabilistic approach that improves data diversity by enforcing the repulsion of elements in the selected subsets. The well-studied Maximum A Posteriori (MAP) inference in DPP aims to identify the subset with the highest diversity quantified by DPP. However, this approach is limited by the presumption that all data samples are available at one point, which hinders its applicability to real-world applications such as traffic datasets where data samples are distributed across sources and communication between them is band-limited. Inspired by the techniques used in Multiple-Input Multiple-Output (MIMO) communication systems, we propose a strategy for performing MAP inference among distributed sources. Specifically, we show that a lower bound of the diversity-maximized distributed sample selection problem can be treated as a power allocation problem in MIMO systems. A determinant-preserved sparse representation of selected samples is used to perform sample precoding in local sources to be processed by DPP. Our method does not require raw data exchange among sources, but rather a band-limited feedback channel to send lightweight diversity measures, analogous to the CSI message in MIMO systems, from the center to data sources. The experiments show that our scalable approach can outperform baseline methods, including random selection, uninformed individual DPP with no feedback, and DPP with SVD-based feedback, in both i.i.d and non-i.i.d setups. Specifically, it achieves 1 to 6 log-difference diversity gain in the latent representation of CIFAR-10, CIFAR-100, StanfordCars, and GTSRB datasets.
翻译:本文提出一种分布式行列式点过程(DPP)推断方法,以在有限通信带宽下增强多源数据多样性。DPP是一种通过强制所选子集中元素相互排斥来提高数据多样性的主流概率方法。经典DPP最大后验(MAP)推断旨在寻找DPP量化多样性最高的子集。然而,该方法受限于所有数据样本均需集中于同一节点的假设,这阻碍了其在交通数据集等真实场景中的应用——在这些场景中,数据样本分布在不同源节点,且节点间通信带宽受限。受多输入多输出(MIMO)通信系统技术启发,我们提出一种在分布式源节点间执行MAP推断的策略。具体而言,我们证明多样性最大化分布式样本选择问题的下界可转化为MIMO系统的功率分配问题。通过保持行列式的稀疏表示,本地源节点对样本进行预编码处理以供DPP使用。本方法无需源节点间传输原始数据,仅需通过有限带宽反馈信道从中心节点向数据源传输轻量级多样性度量(类比MIMO系统的信道状态信息)。实验表明,在独立同分布与非独立同分布场景下,本可扩展方法均优于基线方法(包括随机选择、无反馈的独立DPP及基于SVD反馈的DPP)。具体而言,在CIFAR-10、CIFAR-100、StanfordCars和GTSRB数据集的潜在表示上,本方法实现了1至6个对数差异的多样性增益。