We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
翻译:我们提出了一种新的学习框架,该框架捕捉了许多真实世界用户交互应用中的分层结构,其中用户可根据其对探索风险的不同容忍度分为两组,并应区别对待。在此设定下,我们同时维护两个策略$\pi^{\text{O}}$和$\pi^{\text{E}}$:$\pi^{\text{O}}$(“O”代表“在线”)与第一组中更容忍风险的用户交互,并通过像通常那样平衡探索与利用来最小化遗憾;而$\pi^{\text{E}}$(“E”代表“利用”)则专注于利用迄今收集的数据,为第二组中风险规避的用户进行纯利用。一个重要问题是,这种分离是否能为风险规避用户带来优于标准在线设定(即$\pi^{\text{E}}=\pi^{\text{O}}$)的优势。我们分别考虑与间隙无关和与间隙相关的设定。对于前者,我们证明从极小化极大视角看,这种分离确实无益。对于后者,我们证明若选择悲观值迭代作为产生$\pi^{\text{E}}$的利用算法,我们能为风险规避用户实现独立于轮次$K$的恒定遗憾,这与同一设定下任何在线RL算法的$\Omega(\log K)$遗憾形成鲜明对比,同时$\pi^{\text{O}}$的遗憾(几乎)保持其在线遗憾最优性,无需为$\pi^{\text{E}}$的成功做出妥协。