On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential pitfalls and areas for improvement ("leaderboard smells"). In this regard, we collect up to 1,045 FM leaderboards from five different sources: GitHub, Hugging Face Spaces, Papers With Code, spreadsheet and independent platform, to examine their documentation and engage in direct communication with leaderboard operators to understand their workflows. Through card sorting and negotiated agreement, we identify five distinct workflow patterns and develop a domain model that captures the key components and their interactions within these workflows. We then identify eight unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.

翻译：基础模型（FM），例如大规模机器学习（ML）模型中的大语言模型（LLMs），已在代码补全、代码理解及软件开发等多种下游软件工程（SE）任务中展现出卓越的适应性。因此，FM排行榜已成为SE团队为其特定产品与目的比较和选择最佳第三方FM的重要工具。然而，由于缺乏FM评估与比较的标准化指南，FM排行榜的透明度受到威胁，并限制了利益相关者进行有效FM选择的能力。作为应对此挑战的第一步，本研究聚焦于理解这些FM排行榜在真实场景中的运作方式（“排行榜操作”），并识别潜在的缺陷与改进领域（“排行榜异味”）。为此，我们从GitHub、Hugging Face Spaces、Papers With Code、电子表格及独立平台这五个不同来源收集了多达1,045个FM排行榜，通过审查其文档并与排行榜运营者直接沟通来理解其工作流程。通过卡片分类与协商一致，我们识别出五种不同的工作流模式，并构建了一个领域模型来捕捉这些工作流中的关键组件及其交互。随后，我们识别出LBOps中八种独特的排行榜异味类型。通过消除这些异味，SE团队可以提升当前LBOps实践的透明度、问责性与协作性，从而为FM的比较与选择培育一个更稳健、更负责任的生态系统。