The agency problem emerges in today's large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed \emph{contractual reinforcement learning}, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent's action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve $\tilde{O}(\sqrt{T})$ regret. We also present an algorithm with $\tilde{O}(T^{2/3})$ for the general problem that improves the existing analysis in online contract design with mild technical assumptions.
翻译:在当今大规模机器学习任务中,代理问题日益凸显——学习者往往无法直接指导内容生成或强制实施数据收集。本研究通过契约设计提出一个理论框架,旨在协调在线学习问题中不同利益相关方的经济动机。该问题被称为契约强化学习,其自然起源于经典的马尔可夫决策过程模型:一个学习委托方试图通过设计基于下一状态实现情况而定的支付规则集合,以最优方式影响代理方的行动策略,从而实现双方共同利益。针对规划问题,我们设计了一种高效的动态规划算法,用于确定针对远见型代理的最优契约。针对学习问题,我们提出了一种通用无遗憾学习算法设计,以解决从契约的鲁棒设计到探索与利用平衡的挑战,将复杂度分析简化为高效搜索算法的构建。针对若干自然问题类别,我们设计了定制化搜索算法,可证明实现$\tilde{O}(\sqrt{T})$遗憾度。我们还提出了一种适用于一般性问题的$\tilde{O}(T^{2/3})$算法,该算法在温和技术假设下改进了现有在线契约设计的分析框架。