In many interactive decision-making settings, there is latent and unobserved information that remains fixed. Consider, for example, a dialogue system, where complete information about a user, such as the user's preferences, is not given. In such an environment, the latent information remains fixed throughout each episode, since the identity of the user does not change during an interaction. This type of environment can be modeled as a Latent Markov Decision Process (LMDP), a special instance of Partially Observed Markov Decision Processes (POMDPs). Previous work established exponential lower bounds in the number of latent contexts for the LMDP class. This puts forward a question: under which natural assumptions a near-optimal policy of an LMDP can be efficiently learned? In this work, we study the class of LMDPs with {\em prospective side information}, when an agent receives additional, weakly revealing, information on the latent context at the beginning of each episode. We show that, surprisingly, this problem is not captured by contemporary settings and algorithms designed for partially observed environments. We then establish that any sample efficient algorithm must suffer at least $\Omega(K^{2/3})$-regret, as opposed to standard $\Omega(\sqrt{K})$ lower bounds, and design an algorithm with a matching upper bound.
翻译:在许多交互式决策场景中,存在固定不变的潜在未观测信息。以对话系统为例,用户偏好等完整信息并不可知。在此类环境中,潜在信息在每次交互过程中保持不变——因为用户身份在对话期间不会改变。这类环境可建模为潜在马尔可夫决策过程(LMDP),它是部分观测马尔可夫决策过程(POMDP)的特例。已有研究证明,对于LMDP类问题,潜在上下文数量存在指数级下界。这引出一个关键问题:在何种自然假设下,可以高效学习LMDP的近似最优策略?本文研究了具有**前瞻性辅助信息**的LMDP类问题,即智能体在每个回合开始时能获得关于潜在上下文的额外弱揭示信息。令人惊讶的是,我们发现当前针对部分观测环境设计的设置与算法无法直接解决此问题。我们进一步证明,任何样本高效算法必须承受至少$\Omega(K^{2/3})$的遗憾值——这区别于标准$\Omega(\sqrt{K})$下界,并设计了与之匹配的算法上界。