Harnessing Pre-trained Generalist Agents for Software Engineering Tasks

Nowadays, we are witnessing an increasing adoption of Artificial Intelligence (AI) to develop techniques aimed at improving the reliability, effectiveness, and overall quality of software systems. Deep reinforcement learning (DRL) has recently been successfully used for automation in complex tasks such as game testing and solving the job-shop scheduling problem. However, these specialized DRL agents, trained from scratch on specific tasks, suffer from a lack of generalizability to other tasks and they need substantial time to be developed and re-trained effectively. Recently, DRL researchers have begun to develop generalist agents, able to learn a policy from various environments and capable of achieving performances similar to or better than specialist agents in new tasks. In the Natural Language Processing or Computer Vision domain, these generalist agents are showing promising adaptation capabilities to never-before-seen tasks after a light fine-tuning phase and achieving high performance. This paper investigates the potential of generalist agents for solving SE tasks. Specifically, we conduct an empirical study aimed at assessing the performance of two generalist agents on two important SE tasks: the detection of bugs in games (for two games) and the minimization of makespan in a scheduling task, to solve the job-shop scheduling problem (for two instances). Our results show that the generalist agents outperform the specialist agents with very little effort for fine-tuning, achieving a 20% reduction of the makespan over specialized agent performance on task-based scheduling. In the context of game testing, some generalist agent configurations detect 85% more bugs than the specialist agents. Building on our analysis, we provide recommendations for researchers and practitioners looking to select generalist agents for SE tasks, to ensure that they perform effectively.

翻译：如今，我们正目睹人工智能（AI）在开发旨在提升软件系统可靠性、有效性和整体质量的技术方面得到日益广泛的应用。深度强化学习（DRL）近期已成功用于游戏测试和作业车间调度问题等复杂任务的自动化。然而，这些从零开始针对特定任务训练的专业DRL智能体缺乏对其他任务的泛化能力，且需要大量时间才能有效开发和重新训练。近期，DRL研究人员开始开发通用智能体，使其能够从多种环境中学习策略，并在新任务中达到与专业智能体相当或更优的性能。在自然语言处理或计算机视觉领域，这些通用智能体在经过轻量微调后，展现出对前所未见任务的强大适应能力，并取得了高性能。本文研究了通用智能体在解决软件工程（SE）任务中的潜力。具体而言，我们通过一项实证研究评估了两个通用智能体在两项重要SE任务上的性能：游戏中的缺陷检测（针对两款游戏）以及调度任务中的完工时间最小化（针对两个实例）。结果表明，通用智能体在几乎无需微调努力的情况下优于专业智能体，在基于任务的调度中实现了相比专业智能体性能降低20%的完工时间。在游戏测试场景中，某些通用智能体配置检测到的缺陷比专业智能体多85%。基于分析，我们为寻求选择通用智能体用于SE任务的研究人员和从业者提供了建议，以确保其高效执行任务。