A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.

翻译：强化学习（RL）系统在运行条件与先前经验不同时常出现性能退化，这反映了底层数据生成过程中的分布偏移。此类偏移可能发生在训练与评估之间（如分布内（ID）与分布外（OOD）泛化），也可能出现在环境动态随时间演化的非平稳场景中。然而，这两种视角间的形式化关系尚不明确，现有工作主要聚焦于缓解策略而非智能体-环境交互中偏移的因果本源。本文发展了一种统一的因果本源分类法，系统表征RL中分布偏移的来源，并建立ID/OOD泛化与非平稳场景的关联。通过将分布偏移重新表述为生成性交互过程，我们将经典监督学习中的数据集偏移原则迁移至RL领域。基于部分可观测马尔可夫决策过程（POMDP），我们将交互过程分解为状态分布、观测过程、策略、奖励及转移动态等结构组件，并引入偏移时间边界概念。所提分类法区分了内部（智能体驱动）与外部（环境驱动）两类分布偏移。基于偏移时间边界视角，进一步刻画了显式、隐式与混合三类偏移。该形式化框架将ID/OOD泛化与非平稳性统一为底层过程的结构化变化。我们还提出了一个评估框架，通过性能退化与恢复指标量化偏移影响及自适应能力。通过将分布偏移锚定于RL的因果本源结构，本文支持对分布偏移下鲁棒性的系统化分析。