Unambiguous identification of the rewards driving behaviours of entities operating in complex open-ended real-world environments is difficult, partly because goals and associated behaviours emerge endogenously and are dynamically updated as environments change. Reproducing such dynamics in models would be useful in many domains, particularly where fixed reward functions limit the adaptive capabilities of agents. Simulation experiments described assess a candidate algorithm for the dynamic updating of rewards, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities' survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment. These adjustment happen through endogenous modification of the entities' underlying reward function, during continuous learning, without external intervention.
翻译:准确识别驱动复杂开放真实环境中实体行为的奖励机制存在困难,部分原因在于目标及其关联行为会内源性地涌现,并随环境变化动态更新。在诸多领域复现此类动力学特性具有重要意义,尤其是在固定奖励函数会限制智能体自适应能力的场景中。本研究通过模拟实验评估了一种动态奖励更新候选算法——RULE:基于学习与期望的奖励更新机制。该算法在简化生态系统类环境中进行测试,实验要求实体在生存压力下实现显著行为转变。实验群体成功展示了以下能力:放弃初始受奖但最终有害的行为、强化有利行为、以及对环境中新增物品做出恰当响应。这些调整通过实体在持续学习过程中对其底层奖励函数的内源性修改完成,全程无需外部干预。