The most common approach to implementing data analysis pipelines involves obtaining point estimates from the upstream modules and then treating these as known quantities when working with the downstream ones. This approach is straightforward, but it is likely to underestimate the overall uncertainty associated with any final estimates. An alternative approach involves estimating parameters from the modules jointly using a Bayesian hierarchical model, which has the advantage of propagating upstream uncertainty into the downstream estimates. However, when modules are misspecified, such a joint model can behave in unexpected ways. Furthermore, hierarchical models require the development of ad-hoc computational implementations that can be laborious and computationally expensive. Cut inference modifies the posterior distribution to prevent information flow between certain parameters and provides a third alternative for statistical inference in data analysis pipelines. This paper presents a unified framework that encompasses two-step, cut, and joint inference in the context of data analysis pipelines with two modules and uses two examples to illustrate the tradeoffs associated with these approaches. Our work shows that cut inference provides both some level of robustness and ease of implementation for data analysis pipelines at a lower cost in terms of statistical inference.
翻译:实现数据分析流程最常见的做法是从上游模块获取点估计,然后将这些估计值作为已知量用于下游模块。这种方法虽然直接,但容易低估最终估计的整体不确定性。另一种替代方案是使用贝叶斯层次模型联合估计各模块参数,其优势在于能将上游不确定性传播至下游估计中。然而当模块设定存在偏差时,这种联合模型可能表现出意想不到的行为。此外,层次模型需要开发特定的计算实现方法,耗时且计算成本高昂。修正推断通过修改后验分布来阻止特定参数间的信息流动,为数据分析流程中的统计推断提供了第三种选择。本文提出了一个统一框架,涵盖两模块数据分析流程中的两步推断、修正推断和联合推断,并通过两个实例说明这些方法之间的权衡。研究表明,修正推断能以较低的统计推断成本,为数据分析流程提供一定程度的鲁棒性和实现便利性。