We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback, employing dynamic regret as the performance measure. We start with in-depth analyses of the strengths and limitations of the two most popular methods: occupancy-measure-based and policy-based methods. We observe that while the occupancy-measure-based method is effective in addressing non-stationary environments, it encounters difficulties with the unknown transition. In contrast, the policy-based method can deal with the unknown transition effectively but faces challenges in handling non-stationary environments. Building on this, we propose a novel algorithm that combines the benefits of both methods. Specifically, it employs (i) an occupancy-measure-based global optimization with a two-layer structure to handle non-stationary environments; and (ii) a policy-based variance-aware value-targeted regression to tackle the unknown transition. We bridge these two parts by a novel conversion. Our algorithm enjoys an $\widetilde{\mathcal{O}}(d \sqrt{H^3 K} + \sqrt{HK(H + \bar{P}_K)})$ dynamic regret, where $d$ is the feature dimension, $H$ is the episode length, $K$ is the number of episodes, $\bar{P}_K$ is the non-stationarity measure. We show it is minimax optimal up to logarithmic factors by establishing a matching lower bound. To the best of our knowledge, this is the first work that achieves near-optimal dynamic regret for adversarial linear mixture MDPs with the unknown transition without prior knowledge of the non-stationarity measure.
翻译:本研究在全信息反馈下,针对具有未知转移概率和对抗性奖励的片段式线性混合MDPs,采用动态遗憾作为性能度量指标。我们首先深入分析了两种最主流方法的优势与局限:基于占用测度的方法和基于策略的方法。研究发现,基于占用测度的方法虽能有效应对非平稳环境,但在处理未知转移概率时存在困难;而基于策略的方法虽能有效处理未知转移概率,却在应对非平稳环境时面临挑战。基于此,我们提出了一种融合两者优势的新算法。具体而言,该算法采用:(i)具有双层结构的基于占用测度的全局优化以处理非平稳环境;(ii)基于策略的方差感知价值目标回归以解决未知转移问题。我们通过一种创新的转换机制将这两个部分有机结合。该算法实现了$\widetilde{\mathcal{O}}(d \sqrt{H^3 K} + \sqrt{HK(H + \bar{P}_K)})$的动态遗憾上界,其中$d$为特征维度,$H$为片段长度,$K$为片段数量,$\bar{P}_K$为非平稳性度量。通过建立匹配的下界,我们证明该结果在对数因子范围内达到极小极大最优。据我们所知,这是首个在无需先验非平稳性度量的情况下,针对具有未知转移概率的对抗性线性混合MDPs实现近最优动态遗憾的研究工作。