Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at https://github.com/DMU-XMU/DSR.

翻译：如何提升场景表示能力是视觉导向决策应用中的关键问题，当前方法通常通过在视觉强化学习中学习任务相关的状态表示来解决此问题。尽管先前工作通常引入包含元素（如奖励和动作）的单步行为相似性度量从观测中提取任务相关的状态信息，但它们往往忽略了元素间对学习准确表示至关重要的内在动态关系，这进一步阻碍了在长期动态转换中对短期相似任务/行为信息的区分。为缓解此问题，我们提出一种在视觉强化学习中利用序列模型的内在动态驱动表示学习方法，称为DSR。具体而言，DSR通过底层系统的状态转移动态优化参数化编码器，促使潜在编码信息满足状态转移过程，从而区分状态空间与噪声空间。在实现中，为进一步提升DSR在编码相似任务上的表示能力，采用序列元素的频域分析与多步预测来对内在动态进行序列建模。最终，实验结果表明DSR在视觉干扰DMControl控制任务中取得了显著的性能提升，尤其在骨干基线上平均提升78.9%。进一步结果表明，其在CARLA模拟器的真实世界自动驾驶应用中也取得了最佳性能。此外，定性分析结果验证了我们的方法在视觉任务上具有学习可泛化场景表示的卓越能力。源代码发布于https://github.com/DMU-XMU/DSR。