In the context of neuroevolution, Quality-Diversity algorithms have proven effective in generating repertoires of diverse and efficient policies by relying on the definition of a behavior space. A natural goal induced by the creation of such a repertoire is trying to achieve behaviors on demand, which can be done by running the corresponding policy from the repertoire. However, in uncertain environments, two problems arise. First, policies can lack robustness and repeatability, meaning that multiple episodes under slightly different conditions often result in very different behaviors. Second, due to the discrete nature of the repertoire, solutions vary discontinuously. Here we present a new approach to achieve behavior-conditioned trajectory generation based on two mechanisms: First, MAP-Elites Low-Spread (ME-LS), which constrains the selection of solutions to those that are the most consistent in the behavior space. Second, the Quality-Diversity Transformer (QDT), a Transformer-based model conditioned on continuous behavior descriptors, which trains on a dataset generated by policies from a ME-LS repertoire and learns to autoregressively generate sequences of actions that achieve target behaviors. Results show that ME-LS produces consistent and robust policies, and that its combination with the QDT yields a single policy capable of achieving diverse behaviors on demand with high accuracy.
翻译:在神经演化背景下,质量-多样性算法通过依赖行为空间的定义,已被证明能有效生成多样化且高效策略的集合。构建此类集合所引发的自然目标是尝试按需实现特定行为,这可通过运行集合中的对应策略来完成。然而,在不确定环境中会出现两个问题:首先,策略可能缺乏鲁棒性与可重复性,意味着在略微不同条件下多次执行通常会导致截然不同的行为;其次,由于策略集合的离散特性,解决方案呈现不连续变化。本文提出一种基于两种机制实现行为条件轨迹生成的新方法:其一为MAP-Elites低分散性(ME-LS),该方法将解的选择约束为在行为空间中一致性最高的那些解;其二是质量-多样性变换器(QDT),这是一种基于连续行为描述符的变换器模型,它利用ME-LS策略集合生成的策略数据集进行训练,并学会自回归生成能实现目标行为的动作序列。结果表明,ME-LS能生成一致且鲁棒的策略,而将其与QDT相结合可得单一策略,该策略能以高精度按需实现多样化行为。