We study repeated multi-player vector-valued games in which a player observes a payoff vector each round and evaluates outcomes through linear scalarizations of those vectors. Different from most prior works, the choice of scalarization is treated as an online decision variable rather than a fixed modeling decision. We propose a bi-level learning framework in which an outer learner chooses a scalarization from a finite candidate class on a slow timescale, while a faster inner bandit no-regret learner selects actions using the scalar feedback induced by the chosen scalarization. Performance of this approach is defined with respect to a certain true weight vector, and the deployed scalarizations act as control signals that shape the induced payoff trajectory. We provide implementable algorithms based on bandit online mirror descent with stabilized importance weighting, and we derive finite-time performance guarantees in the form of sublinear regret bounds. Experiments on a vector-valued extension of a canonical game show that convergence to the preferred equilibrium rises from roughly $50\%$ under non-adaptive scalarization to about $80\%$ under our proposed method.
翻译:我们研究了重复进行的多玩家向量值博弈,其中每个玩家每轮观测到一个支付向量,并通过这些向量的线性标量化来评估结果。与大多数先前工作不同,标量化的选择被视为一个在线决策变量而非固定的建模决策。我们提出了一种双层学习框架:外层学习器在慢时间尺度上从有限候选类别中选择一个标量化方式,而内层更快的无遗憾赌博学习器则利用所选标量化方式产生的标量反馈来选择动作。该方法的性能是相对于某个真实权重向量定义的,且所部署的标量化方式作为控制信号塑造了产生的支付轨迹。我们给出了基于带有稳定重要性加权的在线赌博镜像下降的可实现算法,并推导出了以次线性遗憾界形式呈现的有限时间性能保证。在一个典型博弈的向量值扩展上的实验表明,在非自适应标量化下收敛到偏好均衡的概率从约50%提升至我们提出方法下的约80%。