In machine learning (ML), efficient asset management, including ML models, datasets, algorithms, and tools, is vital for resource optimization, consistent performance, and a streamlined development lifecycle. This enables quicker iterations, adaptability, reduced development-to-deployment time, and reliable outputs. Despite existing research, a significant knowledge gap remains in operational challenges like model versioning, data traceability, and collaboration, which are crucial for the success of ML projects. Our study aims to address this gap by analyzing 15,065 posts from developer forums and platforms, employing a mixed-method approach to classify inquiries, extract challenges using BERTopic, and identify solutions through open card sorting and BERTopic clustering. We uncover 133 topics related to asset management challenges, grouped into 16 macro-topics, with software dependency, model deployment, and model training being the most discussed. We also find 79 solution topics, categorized under 18 macro-topics, highlighting software dependency, feature development, and file management as key solutions. This research underscores the need for further exploration of identified pain points and the importance of collaborative efforts across academia, industry, and the research community.
翻译:在机器学习中,高效的资产管理(包括机器学习模型、数据集、算法和工具)对于资源优化、性能一致性和精简的开发生命周期至关重要。这能够实现更快速的迭代、适应性、缩短从开发到部署的时间,并产生可靠输出。尽管已有相关研究,但在模型版本控制、数据可追溯性和协作等运营挑战方面仍存在显著知识空白,而这些对于机器学习项目的成功至关重要。本研究旨在填补这一空白,通过分析来自开发者论坛和平台的15,065篇帖子,采用混合方法对查询进行分类,利用BERTopic提取挑战,并通过开放式卡片分类和BERTopic聚类识别解决方案。我们发现了133个与资产管理挑战相关的话题,归类为16个宏观话题,其中软件依赖性、模型部署和模型训练是被讨论最多的内容。我们还发现了79个解决方案话题,归类为18个宏观话题,强调软件依赖性、特征开发和文件管理是关键的解决方案。这项研究强调了对已识别痛点进行进一步探索的必要性,以及学术界、工业界和研究社区之间合作的重要性。