Q-Regularized Generative Auto-Bidding: From Suboptimal Trajectories to Optimal Policies

from arxiv, Due to the company's compliance requirements, we would like to wait until the paper is officially published before making it publicly available on arXiv

With the rapid development of e-commerce, auto-bidding has become a key asset in optimizing advertising performance under diverse advertiser environments. The current approaches focus on reinforcement learning (RL) and generative models. These efforts imitate offline historical behaviors by utilizing a complex structure with expensive hyperparameter tuning. The suboptimal trajectories further exacerbate the difficulty of policy learning. To address these challenges, we proposes QGA, a novel Q-value regularized Generative Auto-bidding method. In QGA, we propose to plug a Q-value regularization with double Q-learning strategy into the Decision Transformer backbone. This design enables joint optimization of policy imitation and action-value maximization, allowing the learned bidding policy to both leverage experience from the dataset and alleviate the adverse impact of the suboptimal trajectories. Furthermore, to safely explore the policy space beyond the data distribution, we propose a Q-value guided dual-exploration mechanism, in which the DT model is conditioned on multiple return-to-go targets and locally perturbed actions. This entire exploration process is dynamically guided by the aforementioned Q-value module, which provides principled evaluation for each candidate action. Experiments on public benchmarks and simulation environments demonstrate that QGA consistently achieves superior or highly competitive results compared to existing alternatives. Notably, in large-scale real-world A/B testing, QGA achieves a 3.27% increase in Ad GMV and a 2.49% improvement in Ad ROI.

翻译：随着电子商务的快速发展，自动竞价已成为在不同广告主环境下优化广告效果的关键工具。当前方法主要聚焦于强化学习与生成模型。这些工作通过采用复杂结构并依赖昂贵的超参数调优来模仿离线历史行为。次优轨迹进一步加剧了策略学习的难度。为应对这些挑战，本文提出QGA，一种新颖的Q值正则化生成式自动竞价方法。在QGA中，我们提出将基于双重Q学习策略的Q值正则化模块嵌入Decision Transformer主干网络。该设计实现了策略模仿与动作价值最大化的联合优化，使得习得的竞价策略既能利用数据集中的经验，又能减轻次优轨迹的负面影响。此外，为安全探索数据分布之外的策略空间，我们提出一种Q值引导的双重探索机制，其中DT模型以多个回报目标与局部扰动动作为条件进行训练。整个探索过程由前述Q值模块动态引导，该模块为每个候选动作提供原则性评估。在公开基准与仿真环境上的实验表明，相较于现有方法，QGA始终取得更优或极具竞争力的结果。值得注意的是，在大规模真实世界A/B测试中，QGA实现了广告商品交易总额3.27%的提升与广告投资回报率2.49%的改善。