We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.
翻译:我们提出了一种表征驱动的强化学习框架。通过将策略表示为其期望值的估计量,我们利用上下文赌博机技术来引导探索与利用。特别地,将策略网络嵌入线性特征空间,使得我们能够将探索-利用问题重新定义为表征-利用问题,其中良好的策略表征能够实现最优探索。我们通过将该框架应用于进化方法和基于策略梯度的方法,证明了其有效性,与传统方法相比性能得到显著提升。我们的框架为强化学习提供了新的视角,强调了策略表征在确定最优探索-利用策略中的重要性。