Sequential incentive marketing is an important approach for online businesses to acquire customers, increase loyalty and boost sales. How to effectively allocate the incentives so as to maximize the return (e.g., business objectives) under the budget constraint, however, is less studied in the literature. This problem is technically challenging due to the facts that 1) the allocation strategy has to be learned using historically logged data, which is counterfactual in nature, and 2) both the optimality and feasibility (i.e., that cost cannot exceed budget) needs to be assessed before being deployed to online systems. In this paper, we formulate the problem as a constrained Markov decision process (CMDP). To solve the CMDP problem with logged counterfactual data, we propose an efficient learning algorithm which combines bisection search and model-based planning. First, the CMDP is converted into its dual using Lagrangian relaxation, which is proved to be monotonic with respect to the dual variable. Furthermore, we show that the dual problem can be solved by policy learning, with the optimal dual variable being found efficiently via bisection search (i.e., by taking advantage of the monotonicity). Lastly, we show that model-based planing can be used to effectively accelerate the joint optimization process without retraining the policy for every dual variable. Empirical results on synthetic and real marketing datasets confirm the effectiveness of our methods.
翻译:序贯激励营销是在线企业获取客户、提升忠诚度并促进销售的重要手段。然而,如何在预算约束下有效分配激励以最大化回报(例如业务目标),在现有文献中较少被研究。该问题在技术上具有挑战性,原因在于:1)分配策略必须利用历史记录数据进行学习,而这些数据本质上是反事实的;2)在部署至在线系统前,需同时评估策略的最优性与可行性(即成本不得超过预算)。本文将该问题形式化为约束马尔可夫决策过程(CMDP)。为利用记录的反事实数据求解CMDP问题,我们提出了一种结合二分搜索与基于模型规划的高效学习算法。首先,通过拉格朗日松弛将CMDP转换为其对偶形式,并证明该对偶形式关于对偶变量具有单调性。进一步,我们证明对偶问题可通过策略学习求解,且最优对偶变量能借助单调性通过二分搜索高效得到。最后,我们证明基于模型的规划可有效加速联合优化过程,而无需为每个对偶变量重新训练策略。在合成数据集与真实营销数据集上的实验结果证实了所提方法的有效性。