Designing scalable estimation algorithms is a core challenge in modern statistics. Here we introduce a framework to address this challenge based on parallel approximants, which yields estimators with provable properties that operate on the entirety of very large, distributed data sets. We first formalize the class of statistics which admit straightforward calculation in distributed environments through independent parallelization. We then show how to use such statistics to approximate arbitrary functional operators in appropriate spaces, yielding a general estimation framework that does not require data to reside entirely in memory. We characterize the $L^2$ approximation properties of our approach and provide fully implemented examples of sample quantile calculation and local polynomial regression in a distributed computing environment. A variety of avenues and extensions remain open for future work.
翻译:设计可扩展的估计算法是现代统计学的核心挑战。本文基于并行逼近器引入了一个应对该挑战的框架,该框架能够生成具有可证明性质的估计量,并作用于整个超大规模分布式数据集。我们首先对可在分布式环境中通过独立并行化实现直接计算的统计量类别进行形式化。随后展示如何利用此类统计量在适当空间中逼近任意泛函算子,从而构建无需将全部数据驻留内存的通用估计框架。我们刻画了该方法的$L^2$逼近性质,并提供分布式计算环境下样本分位数计算与局部多项式回归的完整实现示例。未来工作仍有众多研究方向与扩展空间尚待探索。