As intelligent trading agents based on reinforcement learning (RL) gain prevalence, it becomes more important to ensure that RL agents obey laws, regulations, and human behavioral expectations. There is substantial literature concerning the aversion of obvious catastrophes like crashing a helicopter or bankrupting a trading account, but little around the avoidance of subtle non-normative behavior for which there are examples, but no programmable definition. Such behavior may violate legal or regulatory, rather than physical or monetary, constraints. In this article, I consider a series of experiments in which an intelligent stock trading agent maximizes profit but may also inadvertently learn to spoof the market in which it participates. I first inject a hand-coded spoofing agent to a multi-agent market simulation and learn to recognize spoofing activity sequences. Then I replace the hand-coded spoofing trader with a simple profit-maximizing RL agent and observe that it independently discovers spoofing as the optimal strategy. Finally, I introduce a method to incorporate the recognizer as normative guide, shaping the agent's perceived rewards and altering its selected actions. The agent remains profitable while avoiding spoofing behaviors that would result in even higher profit. After presenting the empirical results, I conclude with some recommendations. The method should generalize to the reduction of any unwanted behavior for which a recognizer can be learned.
翻译:基于强化学习(RL)的智能交易代理日益普及,确保这些代理遵循法律法规及人类行为期望变得愈发重要。现有大量文献关注如何避免类似直升机坠毁或交易账户破产等明显灾难,但鲜有研究涉及如何规避那些有实例参考却缺乏可编程定义的微妙非规范行为。此类行为可能违反法律或监管约束,而非物理或财务限制。本文通过一系列实验,探究了追求利润最大化的智能股票交易代理是否可能无意中学习操纵其参与的市场。首先,我向多代理市场模拟中注入一个手动编码的操纵代理,并学习识别操纵行为序列。随后,我将该手动编码的操纵交易者替换为简单的利润最大化RL代理,并观察到其独立发现操纵行为为最优策略。最后,我引入一种方法,将识别器作为规范指南,调整代理感知的奖励并改变其选择的行为。该代理在避免本可带来更高利润的操纵行为的同时,仍保持盈利能力。在呈现实证结果后,我提出了一些建议。此方法应能推广至减少任何可学习识别器识别的不良行为。