This paper describes $\pi2\text{vec}$, a method for representing behaviors of black box policies as feature vectors. The policy representations capture how the statistics of foundation model features change in response to the policy behavior in a task agnostic way, and can be trained from offline data, allowing them to be used in offline policy selection. This work provides a key piece of a recipe for fusing together three modern lines of research: Offline policy evaluation as a counterpart to offline RL, foundation models as generic and powerful state representations, and efficient policy selection in resource constrained environments.
翻译:本文介绍了$\pi2\text{vec}$,一种将黑盒策略的行为表征为特征向量的方法。该政策表征以与任务无关的方式捕捉基础模型特征统计数据如何随策略行为变化,并可通过离线数据训练,从而适用于离线策略选择。这项工作为融合三大现代研究方向提供了关键要素:作为离线强化学习对应技术的离线策略评估、作为通用强大状态表征的基础模型,以及资源受限环境中的高效策略选择。