This article develops new tools and new statistical theory for a statistical problem we call Scale Reliant Inference (SRI). Many scientific fields collect multivariate data that lack scale: where the size, sum, or total of each measurement is arbitrary and is not representative of the scale of the underlying system being measured. For example, in the analysis of high-throughput sequencing data, it is well known that the number of sequencing reads (the sequencing depth) varies substantially due non-biological (technical) factors. This article develops a formal problem statement for SRI which unifies problems seen in multiple scientific fields. Informally, we define SRI as an estimation problem in which an estimand of interest cannot be uniquely identified due to the lack of scale information in the observed data. This problem statement represents a reformulation of the related field of Compositional Data Analysis and allows us to prove fundamental limits on SRI. For example, we prove that inferential criteria such as consistency, calibration, and bias are unattainable for common SRI tasks. Moreover, we show that common methods often applied to SRI implicitly assume infinite knowledge of the system scale and can lead to a troubling phenomena termed unacknowledged bias. Counter-intuitively, we show that this problem worsens with more data and can lead to substantially elevated Type-I and Type-II error rates. Still, we show that rigorous statistical inference is possible so long as models acknowledge the fundamental uncertainty in the system scale. We introduce a class of models we call Scale Simulation Random Variables (SSRVs) as flexible, rigorous, and computationally efficient approach to SRI.
翻译:本文针对我们称为“尺度依赖推断”(Scale Reliant Inference, SRI)的统计问题,开发了新的工具与统计理论。许多科学领域采集的多变量数据缺乏尺度信息:即每个测量值的大小、总和或总量具有任意性,无法反映被测量系统本身的真实尺度。例如,在高通量测序数据分析中,众所周知测序读数数量(即测序深度)会因非生物性(技术性)因素而产生显著差异。本文为SRI建立了正式的问题表述,统一了多个科学领域中出现的类似问题。非正式地,我们将SRI定义为一种估计问题:由于观测数据中缺乏尺度信息,感兴趣的目标参数无法被唯一识别。该问题表述重塑了相关领域的成分数据分析(Compositional Data Analysis),并使我们能够证明SRI的基本极限。例如,我们证明了一致性、校准性和无偏性等推断准则在常见SRI任务中无法实现。此外,我们指出,常用于SRI的方法隐含地假设对系统尺度拥有无限知识,这可能导致一种称为“未承认偏差”(unacknowledged bias)的令人困扰的现象。反直觉的是,我们证明该问题会随数据量的增加而恶化,并导致I类与II类错误率显著升高。尽管如此,我们仍证明只要模型承认系统尺度的基本不确定性,严格的统计推断仍是可能的。我们引入一类称为“尺度模拟随机变量”(Scale Simulation Random Variables, SSRVs)的模型,作为解决SRI的灵活、严谨且计算高效的方法。