An Exploratory Study on Automatic Identification of Assumptions in the Development of Deep Learning Frameworks

Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources). To overcome the issues of manually identifying assumptions in DL framework development, we constructed a new and largest dataset (i.e., AssuEval) of assumptions collected from the TensorFlow and Keras repositories on GitHub; explored the performance of seven traditional machine learning models (e.g., Support Vector Machine, Classification and Regression Trees), a popular DL model (i.e., ALBERT), and a large language model (i.e., ChatGPT) of identifying assumptions on the AssuEval dataset. The experiment results show that: ALBERT achieves the best performance (f1-score: 0.9584) of identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.6211, achieved by ChatGPT). Though ChatGPT is the most popular large language model, we do not recommend using it to identify assumptions in DL framework development because of its low performance on the task. Fine-tuning ChatGPT specifically for assumption identification could improve the performance. This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps practitioners better understand assumptions and how to manage them in their projects.

翻译：在深度学习（DL）框架开发过程中，相关方不断做出各类假设。这些假设涉及多种软件制品（如需求、设计决策和技术债务），一旦失效将导致系统故障。现有假设管理方法及工具通常依赖人工识别，但假设广泛分布于DL框架开发的各类来源（如代码注释、提交记录、拉取请求、问题报告）中，人工识别成本高昂（如时间与资源消耗）。为解决DL框架开发中假设自动识别难题，我们构建了目前最大规模的假设数据集AssuEval（数据采集自GitHub平台TensorFlow和Keras仓库）；并评估了七种传统机器学习模型（如支持向量机、分类回归树）、主流DL模型ALBERT以及大语言模型ChatGPT在该数据集上的假设识别性能。实验结果表明：ALBERT模型在AssuEval数据集上取得最佳识别效果（F1分数0.9584），显著优于其他模型（ChatGPT以0.6211的F1分数位居次席）。尽管ChatGPT是最流行的大语言模型，但因其在该任务上表现欠佳，我们不建议将其用于DL框架开发中的假设识别。针对假设识别任务对ChatGPT进行微调可提升其性能。本研究为学界提供了最大规模假设数据集以支持后续研究（如假设分类、评估与推理），并帮助实践者更深入理解假设及其项目管理方法。