We study preference learning through recommendations in multi-agent game settings, where a moderator repeatedly interacts with agents whose utility functions are unknown. In each round, the moderator issues action recommendations and observes whether agents follow or deviate from them. We consider two canonical behavioral feedback models-best response and quantal response-and study how the information revealed by each model affects the learnability of agents' utilities. We show that under quantal-response feedback the game is learnable, up to a positive affine equivalence class, with logarithmic sample complexity in the desired precision, whereas best-response feedback can only identify a larger set of agents' utilities. We give a complete geometric characterization of this set. Moreover, we introduce a regret notion based on agents' incentives to deviate from recommendations and design an online algorithm with low regret under both feedback models, with bounds scaling linearly in the game dimension and logarithmically in time. Our results lay a theoretical foundation for AI recommendation systems in strategic multi-agent environments, where recommendation compliances are shaped by strategic interaction.
翻译:本文研究多智能体博弈环境下的偏好学习问题,其中协调者需与效用函数未知的智能体进行重复交互。在每一轮交互中,协调者发布动作推荐并观察智能体选择遵循或偏离推荐的行为。我们考虑两种典型行为反馈模型——最优响应与量子响应,并分析每种模型所揭示的信息如何影响智能体效用函数的可学习性。研究证明:在量子响应反馈机制下,博弈效用函数可在正仿射等价类意义下以期望精度的对数样本复杂度实现可学习;而最优响应反馈仅能识别更大范围的智能体效用函数集合。我们对该集合给出了完整的几何特征描述。此外,我们提出基于智能体偏离推荐动机的遗憾概念,并设计出在两种反馈模型下均具有低遗憾值的在线算法,其遗憾界随博弈维度线性增长、随时间对数增长。本研究为战略多智能体环境中的人工智能推荐系统奠定了理论基础,其中推荐遵从度由战略交互行为塑造。