Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families.
翻译:我们的目标是开发一种通用策略,将随机变量 $X$ 分解为多个独立的随机变量,且不损失任何关于未知参数的信息。近期一篇论文表明,对于某些著名的自然指数族,$X$ 可以被“稀释”为独立随机变量 $X^{(1)}, \ldots, X^{(K)}$,使得 $X = \sum_{k=1}^K X^{(k)}$。这些独立随机变量可用于各种模型验证和推断任务,包括传统样本拆分失效的场景。本文通过放松求和约束,仅要求独立随机变量的某个已知函数能精确重构 $X$,从而推广了该过程。这一泛化过程具有双重目的:第一,它大幅扩展了可执行稀释操作的分布族范围;第二,它将表面看似迥异的样本拆分与数据稀释统一为同一原理的应用。这一共同原理正是充分性。我们利用这一洞见,对多种分布族执行广义稀释操作。