Reinforcement Learning (RL) is being increasingly used to learn and adapt application behavior in many domains, including large-scale and safety critical systems, as for example, autonomous driving. With the advent of plug-n-play RL libraries, its applicability has further increased, enabling integration of RL algorithms by users. We note, however, that the majority of such code is not developed by RL engineers, which as a consequence, may lead to poor program quality yielding bugs, suboptimal performance, maintainability, and evolution problems for RL-based projects. In this paper we begin the exploration of this hypothesis, specific to code utilizing RL, analyzing different projects found in the wild, to assess their quality from a software engineering perspective. Our study includes 24 popular RL-based Python projects, analyzed with standard software engineering metrics. Our results, aligned with similar analyses for ML code in general, show that popular and widely reused RL repositories contain many code smells (3.95% of the code base on average), significantly affecting the projects' maintainability. The most common code smells detected are long method and long method chain, highlighting problems in the definition and interaction of agents. Detected code smells suggest problems in responsibility separation, and the appropriateness of current abstractions for the definition of RL algorithms.
翻译:强化学习(Reinforcement Learning, RL)正越来越多地被用于学习和适应各领域中的应用行为,包括大规模及安全关键系统,例如自动驾驶。随着即插即用型RL库的出现,其适用性进一步增强,使得用户能够集成RL算法。然而,我们注意到,这类代码大多并非由RL工程师开发,这可能导致程序质量低下,从而引发RL项目中的缺陷、性能次优、可维护性及演化问题。本文开始探索这一假设,针对使用RL的代码,分析从实际环境中获取的不同项目,从软件工程角度评估其质量。我们的研究包含24个流行的基于RL的Python项目,使用标准软件工程度量进行分析。结果表明,与通用机器学习代码的类似分析一致,流行且被广泛复用的RL代码库包含大量代码坏味(平均占代码库的3.95%),显著影响项目的可维护性。检测到的最常见代码坏味是长方法和方法链过长,突显了智能体定义与交互中的问题。检测到的代码坏味暗示了职责分离问题以及当前用于定义RL算法的抽象机制的不恰当性。