Benchmarking involves designing scientific test methods, tools, and frameworks to quantitatively and comparably assess specific performance indicators of certain test subjects. With the development of artificial intelligence, AI benchmarking datasets such as ImageNet and DataPerf have gradually become consensus standards in both academic and industrial fields. However, constructing a benchmarking framework remains a significant challenge in the open-source domain due to the diverse range of data types, the wide array of research issues, and the intricate nature of collaboration networks. This paper introduces OpenPerf, a benchmarking framework designed for the sustainable development of the open-source ecosystem. This framework defines 9 task benchmarking tasks in the open-source research, encompassing 3 data types: time series, text, and graphics, and addresses 6 research problems including regression, classification, recommendation, ranking, network building, and anomaly detection. Based on the above tasks, we implemented 3 data science task benchmarks, 2 index-based benchmarks, and 1 standard benchmark. Notably, the index-based benchmarks have been adopted by the China Electronics Standardization Institute as evaluation criteria for open-source community governance. Additionally, we have developed a comprehensive toolkit for OpenPerf, which not only offers robust data management, tool integration, and user interface capabilities but also adopts a Benchmarking-as-a-Service (BaaS) model to serve academic institutions, industries, and foundations. Through its application in renowned companies and institutions such as Alibaba, Ant Group, and East China Normal University, we have validated OpenPerf's pivotal role in the healthy evolution of the open-source ecosystem.
翻译:基准测试涉及设计科学的测试方法、工具和框架,以定量且可比较地评估特定测试对象的性能指标。随着人工智能的发展,ImageNet与DataPerf等AI基准测试数据集已逐渐成为学术界和工业界的共识标准。然而,由于开源领域数据类型多样、研究问题广泛且协作网络结构复杂,构建基准测试框架仍是一项重大挑战。本文介绍了OpenPerf——一个面向开源生态系统可持续发展的基准测试框架。该框架定义了开源研究的9项任务级基准测试,涵盖时间序列、文本和图形3种数据类型,并涉及回归、分类、推荐、排序、网络构建与异常检测6类研究问题。基于上述任务,我们实现了3个数据科学任务基准、2个基于指标的基准和1个标准基准。值得注意的是,基于指标的基准已被中国电子技术标准化研究院采纳为开源社区治理的评估标准。此外,我们为OpenPerf开发了综合工具包,不仅提供强大的数据管理、工具集成与用户界面功能,还采用基准测试即服务(BaaS)模式服务于学术界、工业界与基金会。通过其在阿里巴巴、蚂蚁集团、华东师范大学等知名企业与机构的应用,我们验证了OpenPerf在推动开源生态系统健康发展中的关键作用。