We study the sample complexity of online reinforcement learning in the general \hzyrev{non-episodic} setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $\mathcal{O}(N ε^2 + d_\mathrm{u}\mathrm{ln}(m(ε))/ε^2)$, where $N$ is the time horizon, $ε$ is a user-specified discretization width, $d_\mathrm{u}$ the input dimension, and $m(ε)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued set of parameters (such as neural networks, transformers, etc.), we prove a policy regret of $\mathcal{O}(\sqrt{d_\mathrm{u}N p})$, where $p$ denotes the number of parameters, recovering earlier sample-complexity results that were derived for linear time-invariant dynamical systems. While this article focuses on characterizing sample complexity, the proposed algorithms are likely to be useful in practice, due to their simplicity, their ability to incorporate prior knowledge, and their benign transient behaviors.
翻译:本文研究了在线强化学习在非线性动力系统(具有连续状态和动作空间)的一般非片段化设定下的样本复杂度。我们的分析涵盖了一大类动力系统,包括从有限个非线性候选模型集合,到具有有界且Lipschitz连续动态的模型,再到由紧致实值参数集参数化的系统。在最一般的设定下,我们的算法实现了 $\mathcal{O}(N ε^2 + d_\mathrm{u}\mathrm{ln}(m(ε))/ε^2)$ 的策略遗憾,其中 $N$ 是时间范围,$ε$ 是用户指定的离散化宽度,$d_\mathrm{u}$ 是输入维度,$m(ε)$ 通过其填充数衡量所考虑函数类的复杂度。在动态由紧致实值参数集(例如神经网络、Transformer 等)参数化的特殊情况下,我们证明了策略遗憾为 $\mathcal{O}(\sqrt{d_\mathrm{u}N p})$,其中 $p$ 表示参数数量,这恢复了先前针对线性时不变动力系统推导出的样本复杂度结果。虽然本文重点在于刻画样本复杂度,但所提出的算法由于其简单性、整合先验知识的能力以及良好的瞬态行为,在实践中可能具有应用价值。