Reinforcement Learning (RL) is being increasingly used to learn and adapt application behavior in many domains, including large-scale and safety critical systems, as for example, autonomous driving. With the advent of plug-n-play RL libraries, its applicability has further increased, enabling integration of RL algorithms by users. We note, however, that the majority of such code is not developed by RL engineers, which as a consequence, may lead to poor program quality yielding bugs, suboptimal performance, maintainability, and evolution problems for RL-based projects. In this paper we begin the exploration of this hypothesis, specific to code utilizing RL, analyzing different projects found in the wild, to assess their quality from a software engineering perspective. Our study includes 24 popular RL-based Python projects, analyzed with standard software engineering metrics. Our results, aligned with similar analyses for ML code in general, show that popular and widely reused RL repositories contain many code smells (3.95% of the code base on average), significantly affecting the projects' maintainability. The most common code smells detected are long method and long method chain, highlighting problems in the definition and interaction of agents. Detected code smells suggest problems in responsibility separation, and the appropriateness of current abstractions for the definition of RL algorithms.
翻译:强化学习(Reinforcement Learning, RL)正越来越多地被用于学习和适应各个领域的应用行为,包括大规模和安全关键系统,例如自动驾驶。随着即插即用RL库的出现,其适用性进一步提高,使得用户能够集成RL算法。然而,我们注意到,这类代码大部分并非由RL工程师开发,这可能导致程序质量低下,从而引发基于RL项目的错误、性能欠佳、可维护性差以及演化问题。本文开始探索这一假设,专门针对使用RL的代码,分析在现实世界中发现的各类项目,以从软件工程角度评估其质量。我们的研究包括24个流行的基于RL的Python项目,并使用标准软件工程指标进行分析。我们的结果与针对通用机器学习代码的类似分析一致,表明流行且被广泛复用的RL代码库包含大量代码坏味(平均占代码库的3.95%),显著影响了项目的可维护性。检测到的最常见代码坏味是长方法和长方法链,这凸显了在代理定义和交互方面的问题。检测到的代码坏味表明存在职责分离问题,以及当前用于定义RL算法的抽象方法的适切性问题。