Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.
翻译:训练分类器在严重类别不平衡时是一项挑战,但许多罕见事件是以更为常见的中间结果为终点的序列之高潮。例如,在在线营销中,用户首先看到广告,然后可能点击它,最终可能进行购买;由于购买事件的稀有性,估计其概率十分困难。我们通过理论分析和数据实验证明,早期步骤中更丰富的数据可用于改进罕见事件概率的估计。我们提出了PRESTO,一种用于序数回归的比例优势模型的松弛形式。该方法不再为相邻类别响应对之间的每个估计贝叶斯决策边界估计一个通过不同截距平移的单一分离超平面的权重,而是为每个过渡估计单独的权重。我们对同一特征在相邻权重向量之间差异施加L1惩罚,以向比例优势模型收缩。我们证明,在稀疏性假设下,PRESTO能一致地估计决策边界权重。合成与真实数据实验表明,我们的方法在该情境下能比逻辑回归(未能从更丰富类别中借用信息)和比例优势模型(过于僵化)更好地估计罕见概率。