Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce SE Arena, an interactive platform designed to evaluate SE-focused chatbots. SE Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. Moreover, SE Arena incorporates a new feature called RepoChat, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SE Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.
翻译:基础模型(FMs),特别是大语言模型(LLMs),在代码生成、调试和需求精化等多种软件工程(SE)任务中展现出巨大潜力。尽管取得了这些进展,现有评估框架仍不足以评估模型在软件工程活动所特有的、迭代且上下文丰富的工作流中的性能。为弥补这一不足,我们推出了SE Arena,这是一个专为评估聚焦软件工程的聊天机器人而设计的交互式平台。SE Arena提供了一个透明、开源的技术排行榜,支持多轮对话工作流,并实现了端到端的模型比较。此外,SE Arena引入了一项名为RepoChat的新功能,它能自动将仓库相关上下文(如问题、提交、拉取请求)注入对话中,从而进一步使评估与真实世界开发流程保持一致。本文概述了SE Arena的设计与功能,强调了其在推进基础模型于软件工程领域的评估与实际应用方面的潜力。