An exploratory study on automatic identification of assumptions in the development of deep learning frameworks

Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources). To overcome the issues of manually identifying assumptions in DL framework development, we constructed a new and largest dataset (i.e., AssuEval) of assumptions collected from the TensorFlow and Keras repositories on GitHub; explored the performance of seven traditional machine learning models (e.g., Support Vector Machine, Classification and Regression Trees), a popular DL model (i.e., ALBERT), and a large language model (i.e., ChatGPT) of identifying assumptions on the AssuEval dataset. The experiment results show that: ALBERT achieves the best performance (f1-score: 0.9584) of identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.6211, achieved by ChatGPT). Though ChatGPT is the most popular large language model, we do not recommend using it to identify assumptions in DL framework development because of its low performance on the task. Fine-tuning ChatGPT specifically for assumption identification could improve the performance. This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps practitioners better understand assumptions and how to manage them in their projects.

翻译：利益相关者在深度学习（DL）框架开发过程中不断做出各类假设。这些假设涉及多种软件制品（如需求、设计决策和技术债务），一旦失效可能导致系统故障。现有假设管理方法及工具通常依赖人工识别假设，然而假设分散在DL框架开发的多源信息中（如代码注释、提交记录、拉取请求和问题报告），人工识别成本高昂（如时间与资源消耗）。为解决DL框架开发中假设人工识别的难题，本研究构建了迄今最大规模的假设数据集AssuEval（数据来自GitHub上的TensorFlow与Keras仓库），并探索了七种传统机器学习模型（如支持向量机、分类回归树）、主流深度学习模型ALBERT以及大型语言模型ChatGPT在该数据集上的假设识别性能。实验结果表明：ALBERT在AssuEval数据集上取得了最佳性能（F1分数达0.9584），显著优于其他模型（次优模型ChatGPT的F1分数为0.6211）。尽管ChatGPT是当前最流行的大型语言模型，但因其在该任务中表现欠佳，我们不建议将其用于DL框架开发中的假设识别。针对假设识别任务对ChatGPT进行专门微调或可提升其性能。本研究为研究者提供了最大规模的假设数据集以供后续研究（如假设分类、评估与推理），同时帮助从业者更深入理解假设及其在项目中的管理方法。