On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.

翻译：基础模型（FM），例如作为大规模机器学习（ML）模型的大语言模型（LLM），已在代码补全、代码理解和软件开发等多种下游软件工程（SE）任务中展现出卓越的适应性。因此，FM排行榜，尤其是托管在云平台上的那些，已成为SE团队为其特定产品和目的比较与选择最佳第三方FM的重要工具。然而，由于缺乏FM评估与比较的标准化指南，FM排行榜的透明度受到威胁，并限制了利益相关者进行有效FM选择的能力。作为应对这一挑战的第一步，我们的研究聚焦于理解这些FM排行榜在真实场景中的运作方式（"排行榜操作"），并识别潜在的排行榜缺陷与改进领域（"排行榜异味"）。为此，我们进行了一项多源文献综述，收集了多达721个FM排行榜，随后我们审查其文档并与排行榜操作者直接沟通，以理解其工作流程模式。通过卡片分类与协商一致，我们识别出5种独特的工作流程模式，并构建了一个领域模型，该模型勾勒出FM排行榜内部的基本组件及其交互方式。接着，我们识别出LBOps中8种独特的排行榜异味类型。通过缓解这些异味，SE团队可以提升当前LBOps实践中的透明度、问责制与协作性，从而为FM的比较与选择培育一个更稳健、更负责任的生态系统。