Estimating Conditional Average Treatment Effects (CATE) at the individual level is central to precision marketing, yet systematic benchmarking of uplift modeling methods at industrial scale remains limited. We present UpliftBench, an empirical evaluation of four CATE estimators: S-Learner, T-Learner, X-Learner (all with LightGBM base learners), and Causal Forest (EconML), applied to the Criteo Uplift v2.1 dataset comprising 13.98 million customer records. The near-random treatment assignment (propensity AUC = 0.509) provides strong internal validity for causal estimation. Evaluated via Qini coefficient and cumulative gain curves, the S-Learner achieves the highest Qini score of 0.376, with the top 20% of customers ranked by predicted CATE capturing 77.7% of all incremental conversions, a 3.9x improvement over random targeting. SHAP analysis identifies f8 as the dominant heterogeneous treatment effect (HTE) driver among the 12 anonymized covariates. Causal Forest uncertainty quantification reveals that 1.9% of customers are confident persuadables (lower 95% CI > 0) and 0.1% are confident sleeping dogs (upper 95% CI < 0). Our results provide practitioners with evidence-based guidance on method selection for large-scale uplift modeling pipelines.
翻译:在个体层面估计条件平均处理效应(CATE)是精准营销的核心,然而工业规模下增益建模方法的系统性基准测试仍十分有限。我们提出UpliftBench,对四种CATE估计器进行实证评估:S-Learner、T-Learner、X-Learner(均以LightGBM为基学习器)以及因果森林(EconML),并应用于包含1398万条客户记录的Criteo Uplift v2.1数据集。近乎随机的处理分配(倾向性AUC = 0.509)为因果估计提供了较强的内部有效性。通过Qini系数和累积增益曲线评估,S-Learner取得了最高的Qini得分(0.376),按预测CATE排序前20%的客户捕获了全部增量转化的77.7%,相比随机定位实现了3.9倍的提升。SHAP分析表明,在12个匿名协变量中,f8是异质处理效应(HTE)的主导驱动因素。因果森林的不确定性量化揭示,1.9%的客户为确信可说服对象(95%置信区间下限 > 0),0.1%的客户为确信沉睡客户(95%置信区间上限 < 0)。我们的结果为从业者在大规模增益建模流程中基于证据选择方法提供了指导。