Distributed Logistic Regression for Massive Data with Rare Events

Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.

翻译：大规模稀有事件数据在实际应用中普遍存在。针对此类海量数据，我们提出一种适用于分布式系统的逻辑回归新型分布式估计方法。在分布式框架下面临两个挑战：第一是数据分配策略，对此我们研究了两种不同的数据分配策略（即随机分配策略与复制策略）；第二是如何选择最优渐近效率的目标函数类型，为此我们考虑了欠采样（US）和逆概率加权（IPW）两类目标函数。研究结果表明，采用复制策略结合逆概率加权目标函数是处理稀有事件分布式逻辑回归的最佳方案。通过模拟实验和瑞典交通标志真实数据集验证了该分布式方法的有限样本性能。

相关内容

逻辑回归

关注 318

逻辑回归（也称“对数几率回归”）（英语：Logistic regression 或logit regression），即逻辑模型（英语：Logit model，也译作“评定模型”、“分类评定模型”）是离散选择法模型之一，属于多重变量分析范畴，是社会学、生物统计学、临床、数量心理学、计量经济学、市场营销等统计实证分析的常用方法。在统计学中，logistic模型(或logit模型)用于对存在的某个类或事件的概率建模，例如通过/失败、赢/输、活着/死了或健康/生病。这可以扩展到建模若干类事件，如确定一个图像是否包含猫、狗、狮子等。图像中检测到的每个物体的概率都在0到1之间，其和为1。

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

【干货书】工程和科学中的概率和统计，

专知会员服务

58+阅读 · 2022年12月24日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日