We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
翻译:我们提出了一种新的学习框架,该框架捕捉了众多现实世界用户交互应用中的层级结构,即用户可根据对探索风险的不同容忍度分为两组,并应区别对待。在此设置中,我们同时维护两个策略 $\pi^{\text{O}}$ 和 $\pi^{\text{E}}$:$\pi^{\text{O}}$(“O”代表“在线”)与来自第一层级更具风险容忍度的用户交互,通过平衡探索与利用来最小化遗憾;而 $\pi^{\text{E}}$(“E”代表“利用”)则专为来自第二层级的风险规避型用户设计,利用迄今收集的数据专注于利用。一个重要问题是:对于风险规避型用户而言,这种分离是否比标准在线设置(即 $\pi^{\text{E}}=\pi^{\text{O}}$)更具优势?我们分别考虑了与间隙无关和与间隙相关两种设置。对于前者,我们从极小化极大角度证明这种分离实际上并无益处。对于后者,我们表明若选择悲观价值迭代作为产生 $\pi^{\text{E}}$ 的利用算法,则可实现风险规避型用户的恒定遗憾,该遗憾独立于回合数 $K$,这与同一设置中任何在线强化学习算法所面临的 $\Omega(\log K)$ 遗憾形成鲜明对比;同时 $\pi^{\text{O}}$ 的遗憾(几乎)保持其在线遗憾最优性,无需为 $\pi^{\text{E}}$ 的成功做出妥协。