This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent's decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon $T$) learning algorithms for the principal's regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments.
翻译:本文研究了一种重复的委托-代理人赌博博弈,其中委托人只能通过代理人与环境互动。委托人和代理人目标不一致,且行动选择权完全掌握在代理人手中。然而,委托人可以通过提供激励来影响代理人的决策,这些激励将计入代理人的收益。委托人旨在迭代学习一种激励策略,以最大化自身总效用。该框架扩展了标准赌博问题,并受到医疗保健或生态税收等实际应用的启发,在这些应用中,传统机制设计理论常常忽略问题的学习层面。我们在多臂和线性上下文两种设定下,提出了接近最优(关于时间范围$T$)的委托人遗憾学习算法。最后,我们通过数值实验支持了理论保证。