Representation learning is a powerful tool that enables learning over large multitudes of agents or domains by enforcing that all agents operate on a shared set of learned features. However, many robotics or controls applications that would benefit from collaboration operate in settings with changing environments and goals, whereas most guarantees for representation learning are stated for static settings. Toward rigorously establishing the benefit of representation learning in dynamic settings, we analyze the regret of multi-task representation learning for linear-quadratic control. This setting introduces unique challenges. Firstly, we must account for and balance the $\textit{misspecification}$ introduced by an approximate representation. Secondly, we cannot rely on the parameter update schemes of single-task online LQR, for which least-squares often suffices, and must devise a novel scheme to ensure sufficient improvement. We demonstrate that for settings where exploration is "benign", the regret of any agent after $T$ timesteps scales as $\tilde O(\sqrt{T/H})$, where $H$ is the number of agents. In settings with "difficult" exploration, the regret scales as $\tilde{\mathcal O}(\sqrt{d_u d_\theta} \sqrt{T} + T^{3/4}/H^{1/5})$, where $d_x$ is the state-space dimension, $d_u$ is the input dimension, and $d_\theta$ is the task-specific parameter count. In both cases, by comparing to the minimax single-task regret $\tilde{\mathcal O}(\sqrt{d_x d_u^2}\sqrt{T})$, we see a benefit of a large number of agents. Notably, in the difficult exploration case, by sharing a representation across tasks, the effective task-specific parameter count can often be small $d_\theta < d_x d_u$. Lastly, we provide numerical validation of the trends we predict.
翻译:表示学习是一种强大的工具,它通过强制所有智能体在一组共享的习得特征上操作,从而实现对大量智能体或领域的学习。然而,许多本可从协作中受益的机器人学或控制应用,其运行环境与目标处于动态变化之中,而现有关于表示学习的性能保证大多针对静态场景。为了严格确立表示学习在动态场景中的优势,我们分析了线性二次控制中多任务表示学习的遗憾。此场景引入了独特的挑战。首先,我们必须考虑并平衡由近似表示引入的 $\textit{误设}$。其次,我们不能依赖单任务在线LQR的参数更新方案(该方案通常最小二乘法即可满足),而必须设计一种新的方案以确保足够的性能提升。我们证明,在探索过程"良性"的场景下,任何智能体在 $T$ 个时间步后的遗憾以 $\tilde O(\sqrt{T/H})$ 的速率增长,其中 $H$ 是智能体数量。在探索过程"困难"的场景下,遗憾以 $\tilde{\mathcal O}(\sqrt{d_u d_\theta} \sqrt{T} + T^{3/4}/H^{1/5})$ 的速率增长,其中 $d_x$ 是状态空间维度,$d_u$ 是输入维度,$d_\theta$ 是任务特定参数数量。在这两种情况下,通过与极小极大单任务遗憾 $\tilde{\mathcal O}(\sqrt{d_x d_u^2}\sqrt{T})$ 比较,我们看到了大量智能体带来的优势。值得注意的是,在困难探索场景中,通过在任务间共享表示,有效的任务特定参数数量通常可以很小,即 $d_\theta < d_x d_u$。最后,我们对我们预测的趋势提供了数值验证。