Aligning large language models (LLMs) with human preferences is inherently multi-objective: different users and evaluation criteria impose heterogeneous and often conflicting requirements on model outputs. We propose CAGE (Common-Agency Games for Alignment), a training-free, game-theoretic framework for multi-objective test-time alignment. CAGE models alignment objectives as strategic principals that allocate token-level incentives to a shared LLM, inducing an equilibrium policy that captures the joint effect of competing objectives. We develop an efficient algorithm based on equilibrium problems with equilibrium constraints (EPEC) to compute this equilibrium, and establish theoretical guarantees including existence and uniqueness of the equilibrium policy, convergence and stability of the algorithm, and no-regret learning dynamics. Empirically, CAGE enables flexible and fine-grained trade-offs across objectives at inference time, consistently outperforming existing test-time alignment methods while requiring no retraining. It further supports weak-to-strong generalization, making multi-objective alignment practical in resource-constrained settings.
翻译:对齐大语言模型(LLM)与人类偏好本质上是多目标的:不同用户和评估准则对模型输出施加了异质且经常相互冲突的需求。我们提出CAGE(面向对齐的共同代理博弈),一种免训练、基于博弈论的多目标测试时对齐框架。CAGE将对齐目标建模为战略性代理,向共享的LLM分配词元级别的激励,从而诱导出一个捕获竞争目标联合效应的均衡策略。我们开发了一种基于含均衡约束的均衡问题(EPEC)的高效算法来计算此均衡,并建立了理论保证,包括均衡策略的存在性与唯一性、算法的收敛性与稳定性,以及无遗憾学习动态。实验结果表明,CAGE能够在推理时实现目标之间灵活且细粒度的权衡,持续优于现有测试时对齐方法,且无需重新训练。它还支持弱到强泛化,使得多目标对齐在资源受限场景下变得实用。