Adversarial Constrained Bidding via Minimax Regret Optimization with Causality-Aware Reinforcement Learning

The proliferation of the Internet has led to the emergence of online advertising, driven by the mechanics of online auctions. In these repeated auctions, software agents participate on behalf of aggregated advertisers to optimize for their long-term utility. To fulfill the diverse demands, bidding strategies are employed to optimize advertising objectives subject to different spending constraints. Existing approaches on constrained bidding typically rely on i.i.d. train and test conditions, which contradicts the adversarial nature of online ad markets where different parties possess potentially conflicting objectives. In this regard, we explore the problem of constrained bidding in adversarial bidding environments, which assumes no knowledge about the adversarial factors. Instead of relying on the i.i.d. assumption, our insight is to align the train distribution of environments with the potential test distribution meanwhile minimizing policy regret. Based on this insight, we propose a practical Minimax Regret Optimization (MiRO) approach that interleaves between a teacher finding adversarial environments for tutoring and a learner meta-learning its policy over the given distribution of environments. In addition, we pioneer to incorporate expert demonstrations for learning bidding strategies. Through a causality-aware policy design, we improve upon MiRO by distilling knowledge from the experts. Extensive experiments on both industrial data and synthetic data show that our method, MiRO with Causality-aware reinforcement Learning (MiROCL), outperforms prior methods by over 30%.

翻译：互联网的普及推动了在线广告的兴起，其核心机制源于在线拍卖。在这些重复拍卖中，软件代理代表聚合广告主参与，以优化长期效用。为满足多样化需求，投标策略被用于在不同支出约束下优化广告目标。现有约束竞价方法通常依赖于独立同分布（i.i.d.）的训练与测试条件，这与在线广告市场的对抗性本质相悖——不同参与方可能持有相互冲突的目标。对此，我们探讨了对抗性竞价环境中的约束竞价问题，该环境假设对对抗性因素一无所知。我们的核心洞察是：不依赖i.i.d.假设，而是通过对齐训练环境的分布与潜在测试分布，同时最小化策略遗憾。基于此洞察，我们提出了一种实用的最小化遗憾优化（MiRO）方法，该方法交替进行两个步骤：教师模型寻找对抗性环境进行指导，学生模型在给定环境分布上通过元学习优化策略。此外，我们率先引入专家示范来学习投标策略。通过因果感知策略设计，我们从专家知识中提炼经验以改进MiRO。在工业数据与合成数据上的大量实验表明，我们的方法——MiRO结合因果感知强化学习（MiROCL）——相比先前方法性能提升超过30%。