Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources). To overcome the issues of manually identifying assumptions in DL framework development, we constructed a new and largest dataset (i.e., AssuEval) of assumptions collected from the TensorFlow and Keras repositories on GitHub; explored the performance of seven traditional machine learning models (e.g., Support Vector Machine, Classification and Regression Trees), a popular DL model (i.e., ALBERT), and a large language model (i.e., ChatGPT) of identifying assumptions on the AssuEval dataset. The experiment results show that: ALBERT achieves the best performance (f1-score: 0.9584) of identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.6211, achieved by ChatGPT). Though ChatGPT is the most popular large language model, we do not recommend using it to identify assumptions in DL framework development because of its low performance on the task. Fine-tuning ChatGPT specifically for assumption identification could improve the performance. This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps practitioners better understand assumptions and how to manage them in their projects.
翻译:在深度学习框架开发过程中,利益相关者会持续做出各类假设。这些假设涉及多种软件制品(如需求、设计决策和技术债务),一旦失效可能导致系统故障。现有假设管理方法和工具通常依赖人工识别,但假设分散在深度学习框架开发的各类来源中(例如代码注释、提交记录、拉取请求和议题),人工识别存在高昂的时间与资源成本。为解决深度学习框架开发中人工识别假设的难题,我们构建了目前最大规模的假设数据集AssuEval(数据来自GitHub上的TensorFlow和Keras仓库);评估了七种传统机器学习模型(如支持向量机、分类与回归树)、一个主流深度学习模型ALBERT以及大语言模型ChatGPT在AssuEval数据集上的假设识别性能。实验结果表明:ALBERT在AssuEval数据集上取得最佳识别效果(F1分数:0.9584),显著优于其他模型(次优模型ChatGPT的F1分数为0.6211)。尽管ChatGPT是最流行的大语言模型,但由于其在假设识别任务中表现不佳,我们不建议将其用于深度学习框架开发的假设识别。针对假设识别任务微调ChatGPT有望提升其性能。本研究为研究者提供了最大规模的假设数据集以开展后续研究(如假设分类、评估与推理),并帮助实践者深入理解假设及其项目管理方法。