Machine learning (ML) components are increasingly incorporated into software products, yet developers face challenges in transitioning from ML prototypes to products. Academic researchers struggle to propose solutions to these challenges and evaluate interventions because they often do not have access to close-sourced ML products from industry. In this study, we define and identify open-source ML products, curating a dataset of 262 repositories from GitHub, to facilitate further research and education. As a start, we explore six broad research questions related to different development activities and report 21 findings from a sample of 30 ML products from the dataset. Our findings reveal a variety of development practices and architectural decisions surrounding different types and uses of ML models that offer ample opportunities for future research innovations. We also find very little evidence of industry best practices such as model testing and pipeline automation within the open-source ML products, which leaves room for further investigation to understand its potential impact on the development and eventual end-user experience for the products.
翻译:机器学习(ML)组件正日益融入软件产品,但开发者在从ML原型过渡到产品时面临诸多挑战。学术研究者往往难以针对这些挑战提出解决方案并评估干预措施,原因是他们通常无法访问工业界的闭源ML产品。本研究定义并识别了开源ML产品,从GitHub中精心整理了一个包含262个仓库的数据集,以促进后续研究与教学。作为初步探索,我们针对不同开发活动提出了六大类研究问题,并从数据集中选取30个ML产品的样本报告了21项发现。研究结果揭示了围绕不同类型及用途的ML模型所采用的各种开发实践与架构决策,为未来研究创新提供了丰富机遇。同时,我们发现开源ML产品中很少存在模型测试与流水线自动化等工业最佳实践,这为深入探究其对产品开发及最终用户体验的潜在影响留出了研究空间。