On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.

翻译：基础模型（FM），例如大规模机器学习（ML）模型中的大语言模型（LLMs），已在代码补全、代码理解及软件开发等各类下游软件工程（SE）任务中展现出卓越的适应性。因此，FM排行榜——尤其是托管于云平台的排行榜——已成为SE团队针对特定产品与目标比较和选择最佳第三方FM的关键工具。然而，由于缺乏标准化的FM评估与比较指南，FM排行榜的透明度受到威胁，也限制了利益相关方有效执行FM选择的能力。作为应对这一挑战的初步探索，本研究聚焦于理解这些FM排行榜在真实场景中的运作机制（“排行榜操作”），并识别潜在的排行榜缺陷与改进空间（“排行榜缺陷”）。为此，我们通过多源文献综述收集了多达721个FM排行榜，继而审查其文档资料并与排行榜运营者直接沟通，以解析其工作流程模式。通过卡片分类与协商共识方法，我们识别出5种独特的工作流程模式，并构建了一个领域模型，用以勾勒FM排行榜中的核心组件及其交互关系。随后，我们识别出LBOps中8类独特的排行榜缺陷类型。通过消除这些缺陷，SE团队能够提升当前LBOps实践中的透明度、问责制与协作效率，从而为FM比较与选择培育更稳健、更负责任的生态系统。