We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
翻译:我们提出数据稀疏化方法,可将观测数据分解为两个或多个独立部分,这些部分之和等于原始观测值,且除参数(已知)尺度缩放外均服从与原始观测值相同的分布。这一极具普适性的方法适用于所有卷积封闭分布族,包括高斯分布、泊松分布、负二项分布、伽马分布、二项分布等。数据稀疏化在模型选择、评估和推断中具有多项应用。例如,通过数据稀疏化实现的交叉验证为传统样本分割交叉验证提供了富有吸引力的替代方案,尤其在后者不适用的情况下尤为突出。通过模拟实验和单细胞RNA测序数据应用,我们证实数据稀疏化可用于验证无监督学习方法(如k均值聚类和主成分分析)的结果,而传统样本分割方法在这些场景中既不理想也不可行。