Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

In personalized marketing, uplift models estimate the incremental effect of an intervention by modeling how customer behavior would change under alternative treatments using counterfactual analysis. However, real-world marketing data often exhibit various biases, such as selection bias, spillover effects, measurement error, and unobserved confounding. These biases can adversely affect both the accuracy of uplift estimation and the validity of evaluation metrics. Despite the importance of bias-aware assessment, there remains a lack of systematic studies evaluating how different models and metrics perform under such biased conditions. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets inherently lack counterfactual ground truth. This limitation renders the direct validation of evaluation metrics infeasible and prevents the precise quantification of biases. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking. This approach effectively bridges the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) the stability of evaluation metrics is linked to their mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and evaluation metrics under real-world data imperfections.

翻译：在个性化营销中，增益模型通过反事实分析建模客户在不同干预下的行为变化，来估计干预措施的增量效应。然而，现实营销数据常呈现多种偏差，如选择偏差、溢出效应、测量误差及未观测混杂因素。这些偏差会同时损害增益估计的准确性与评估指标的有效性。尽管偏差感知评估至关重要，但系统研究不同模型与度量在偏差条件下的表现仍显不足。为填补该空白，我们设计了一个系统性基准评估框架。与标准预测任务不同，现实增益数据集天然缺乏反事实真实标签。这一局限使得评估指标的直接验证不可行，也阻碍了偏差的精确量化。因此，半合成方法成为系统性基准评估的关键工具——该方法在保留现实特征依赖关系的同时，提供分离结构偏差所需的真实标签，有效弥合了差距。研究发现：(i) 增益定向与预测可能表现为不同目标，擅长其一未必保证另一效果；(ii) 多数模型在多样化偏差下表现不稳定，而TARNet展现出显著鲁棒性，为后续模型设计提供启示；(iii) 评估指标的稳定性与其数学上与ATE的对齐程度相关，表明在结构数据不完美条件下，近似ATE的指标能产生更一致的模型排序。这些发现揭示了在现实数据不完美条件下构建更鲁棒增益模型与评估指标的必要性。