Many real-world competitive systems require multiple decision-makers to act simultaneously under shared constraints, limited information, and repeated interaction, as in auctions, resource allocation, and security competition. We study multi-turn simultaneous bidding as a controlled testbed for such problems and propose DNQ, a solver-in-the-loop equilibrium supervision framework for training bidding agents. DNQ alternates between trajectory collection, critic-based payoff estimation, equilibrium computation, and policy imitation. At each visited state, a shared critic predicts either pairwise payoff matrices or an exact N-player payoff tensor, an external solver computes equilibrium strategies, and the agents are trained by minimizing the KL divergence between their masked policies and the solver-derived equilibrium targets. We focus on a scalable pairwise formulation that greatly reduces equilibrium-solving cost and training time compared with the exact formulation, while the shared critic amortizes payoff learning across agents and states. Experiments compare the pairwise and exact variants using critic loss, policy entropy, bidding resource usage, and training cost, showing that the pairwise method scales to larger numbers of agents, whereas the exact method becomes computationally impractical as the joint game grows. These results illustrate the trade-off between strategic fidelity and scalability in repeated competitive environments.
翻译:许多现实世界的竞争系统要求多个决策者在共享约束、有限信息和重复交互下同时行动,例如拍卖、资源分配和安全竞争。我们以多轮同步竞价作为此类问题的可控测试平台,并提出DNQ——一种将求解器融入循环的均衡监督框架,用于训练竞价智能体。DNQ交替执行轨迹收集、基于评判器的收益估计、均衡计算和策略模仿。在每个访问状态中,共享评判器预测两两收益矩阵或精确的n人收益张量,外部求解器计算均衡策略,然后通过最小化智能体掩码策略与求解器导出的均衡目标之间的KL散度来训练智能体。我们聚焦于可扩展的两两形式,与精确形式相比,该形式大幅降低了均衡求解成本和训练时间,同时共享评判器跨智能体和状态摊销了收益学习。实验通过评判器损失、策略熵、竞价资源使用和训练成本对两两变体和精确变体进行了比较,结果表明两两方法可扩展至更多智能体数量,而精确方法随着联合博弈规模增长在计算上变得不可行。这些结果揭示了重复竞争环境中策略保真度与可扩展性之间的权衡。