We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.
翻译:我们提出数据稀疏化方法,该方法可将观测数据分解为两个或多个独立分量,这些分量之和等于原始观测值,且服从与原始观测值相同的分布(仅参数存在已知尺度变换)。这一普适性方法适用于任何卷积闭分布,包括高斯分布、泊松分布、负二项分布、伽马分布、二项分布等。数据稀疏化在模型选择、评估和推断中具有多种应用。例如,通过数据稀疏化实现的交叉验证为传统样本划分交叉验证提供了引人注目的替代方案,尤其在后者无法应用的无监督场景中。在模拟实验和单细胞RNA测序数据的实际应用中,我们证明数据稀疏化可用于验证k均值聚类和主成分分析等无监督学习方法的有效性。