Q正则化生成式自动出价：从次优轨迹到最优策略 (Q-Regularized Generative Auto-Bidding: From Suboptimal Trajectories to Optimal Policies)

With the rapid development of e-commerce, auto-bidding has become a key asset in optimizing advertising performance under diverse advertiser environments. The current approaches focus on reinforcement learning (RL) and generative models. These efforts imitate offline historical behaviors by utilizing a complex structure with expensive hyperparameter tuning. The suboptimal trajectories further exacerbate the difficulty of policy learning. To address these challenges, we proposes QGA, a novel Q-value regularized Generative Auto-bidding method. In QGA, we propose to plug a Q-value regularization with double Q-learning strategy into the Decision Transformer backbone. This design enables joint optimization of policy imitation and action-value maximization, allowing the learned bidding policy to both leverage experience from the dataset and alleviate the adverse impact of the suboptimal trajectories. Furthermore, to safely explore the policy space beyond the data distribution, we propose a Q-value guided dual-exploration mechanism, in which the DT model is conditioned on multiple return-to-go targets and locally perturbed actions. This entire exploration process is dynamically guided by the aforementioned Q-value module, which provides principled evaluation for each candidate action. Experiments on public benchmarks and simulation environments demonstrate that QGA consistently achieves superior or highly competitive results compared to existing alternatives. Notably, in large-scale real-world A/B testing, QGA achieves a 3.27% increase in Ad GMV and a 2.49% improvement in Ad ROI.

翻译：随着电子商务的快速发展，自动出价已成为在不同广告主环境下优化广告效果的关键技术。当前方法主要聚焦于强化学习和生成模型。这些工作通过使用复杂结构并依赖昂贵的超参数调优来模仿离线历史行为。次优轨迹进一步加剧了策略学习的难度。为应对这些挑战，本文提出QGA，一种新颖的Q值正则化生成式自动出价方法。在QGA中，我们提出将结合双Q学习策略的Q值正则化模块嵌入到Decision Transformer主干网络中。该设计实现了策略模仿与动作价值最大化的联合优化，使得学习到的出价策略既能利用数据集中的经验，又能减轻次优轨迹的不利影响。此外，为安全探索数据分布之外的策略空间，我们提出一种Q值引导的双重探索机制，其中DT模型以多个回报目标及局部扰动动作为条件。整个探索过程由前述Q值模块动态引导，该模块为每个候选动作提供原则性评估。在公开基准和仿真环境上的实验表明，与现有方法相比，QGA始终能取得更优或极具竞争力的结果。值得注意的是，在大规模真实世界A/B测试中，QGA实现了广告总交易额3.27%的提升和广告投资回报率2.49%的改进。