Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.
翻译:假设我们观察到来自已知分布族 $P$(参数未知)的随机向量 $X$。我们提出以下问题:何时能将 $X$ 拆分为两部分 $f(X)$ 和 $g(X)$,使得任一部分本身无法独立重构 $X$,但两者联合可完全恢复 $X$,且 $(f(X), g(X))$ 的联合分布易于处理?以示例说明,若 $X=(X_1,\dots,X_n)$ 且 $P$ 为乘积分布,则对任意 $m<n$,可通过拆分样本定义 $f(X)=(X_1,\dots,X_m)$ 和 $g(X)=(X_{m+1},\dots,X_n)$。Rasines 与 Young (2022) 提出另一种实现途径:通过加性高斯噪声随机化 $X$,从而在高斯分布数据中实现有限样本的选择后推断,并在非高斯加性模型中渐近适用。本文通过借鉴贝叶斯推断思想,提出一种更通用的有限样本拆分方法,该方法可视为数据拆分的连续模拟,并产生(频率学派)解。我们将此方法命名为数据分裂,以区别于数据拆分、数据雕刻及p值掩码技术。最后,通过趋势过滤的选择后推断及其他回归问题等典型应用场景演示该方法。