Dependent Dirichlet processes via thinning

When analyzing data from multiple sources, it is often convenient to strike a careful balance between two goals: capturing the heterogeneity of the samples and sharing information across them. We introduce a novel framework to model a collection of samples using dependent Dirichlet processes constructed through a thinning mechanism. The proposed approach modifies the stick-breaking representation of the Dirichlet process by thinning, that is, setting equal to zero a random subset of the beta random variables used in the original construction. This results in a collection of dependent random distributions that exhibit both shared and unique atoms, with the shared ones assigned distinct weights in each distribution. The generality of the construction allows expressing a wide variety of dependence structures among the elements of the generated random vectors. Moreover, its simplicity facilitates the characterization of several theoretical properties and the derivation of efficient computational methods for posterior inference. A simulation study illustrates how a modeling approach based on the proposed process reduces uncertainty in group-specific inferences while preventing excessive borrowing of information when the data indicate it is unnecessary. This added flexibility improves the accuracy of posterior inference, outperforming related state-of-the-art models. An application to the Collaborative Perinatal Project data highlights the model's capability to estimate group-specific densities and uncover a meaningful partition of the observations, both within and across samples, providing valuable insights into the underlying data structure.

翻译：在分析多源数据时，通常需要在两个目标之间取得审慎的平衡：捕捉样本的异质性与在样本间共享信息。我们提出一种新颖的框架，通过稀疏化机制构建的相依狄利克雷过程来对一组样本进行建模。该方法通过对狄利克雷过程的折棍表示进行稀疏化处理——即将原始构造中所使用的贝塔随机变量的一个随机子集设为零——来实现。这产生了一组相依的随机分布，这些分布同时包含共享原子和独特原子，且共享原子在每个分布中被赋予不同的权重。该构造的通用性使其能够表达生成随机向量元素之间多种多样的相依结构。此外，其简洁性有助于刻画若干理论性质，并推导出用于后验推断的高效计算方法。一项模拟研究表明，基于所提出过程的建模方法如何在数据表明不必要时，既减少组别特定推断中的不确定性，又防止过度借用信息。这种增强的灵活性提高了后验推断的准确性，优于相关的先进模型。在协作围产期项目数据中的应用，突显了该模型在估计组别特定密度以及揭示观测值（在样本内部和跨样本之间）有意义划分方面的能力，从而为理解底层数据结构提供了有价值的洞见。