We introduce Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, with probability determined by the chosen action. We derive Bellman equations tailored to this setting and establish the existence of an optimal policy. Exploiting the fact that sporadic observations reveal the full state, we provide an equivalent formulation in which agents commit to action-sequences between consecutive observations. Under the linear MDP assumption, we show that the value function over such action-sequences admits a linear representation in a finite-dimensional feature map, enabling standard regression-based methods. As an application, we derive ATST-LSVI-UCB, an optimistic algorithm achieving regret $\widetilde{O}(\sqrt{Kd^3(1-γ)^{-3}})$ for episodic learning with geometrically distributed horizons, where $K$ is the number of episodes, $d$ the feature dimension, and $γ$ the discount factor (episode continuation probability), matching the known rate for linear MDPs with full observability.
翻译:我们引入动作触发零星可追踪马尔可夫决策过程(ATST-MDPs),这是一种处理部分可观测性的强化学习框架,其中完整状态观测以由所选动作决定的概率随机发生在每个时间步。我们推导了适用于该场景的贝尔曼方程,并证明了最优策略的存在性。利用零星观测揭示完整状态这一事实,我们提供了一个等效的公式表述,其中智能体在连续观测之间提交动作序列。在线性MDP假设下,我们证明了此类动作序列上的值函数在有限维特征映射中具有线性表示,从而能够使用标准的基于回归的方法。作为一项应用,我们推导了ATST-LSVI-UCB,一种乐观算法,针对几何分布视界的情景式学习实现了遗憾值$\widetilde{O}(\sqrt{Kd^3(1-γ)^{-3}})$,其中$K$是情景数量,$d$是特征维度,$γ$是折扣因子(情景延续概率),这与完全可观测性下线性MDP的已知速率相匹配。