Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.
翻译:大规模稀有事件数据在实际应用中普遍存在。针对此类海量数据,我们提出一种适用于分布式系统的逻辑回归新型分布式估计方法。在分布式框架下面临两个挑战:第一是数据分配策略,对此我们研究了两种不同的数据分配策略(即随机分配策略与复制策略);第二是如何选择最优渐近效率的目标函数类型,为此我们考虑了欠采样(US)和逆概率加权(IPW)两类目标函数。研究结果表明,采用复制策略结合逆概率加权目标函数是处理稀有事件分布式逻辑回归的最佳方案。通过模拟实验和瑞典交通标志真实数据集验证了该分布式方法的有限样本性能。