Large language models (LLMs) like ChatGPT have shown the potential to assist developers with coding and debugging tasks. However, their role in collaborative issue resolution is underexplored. In this study, we analyzed 1,152 Developer-ChatGPT conversations across 1,012 issues in GitHub to examine the diverse usage of ChatGPT and reliance on its generated code. Our contributions are fourfold. First, we manually analyzed 289 conversations to understand ChatGPT's usage in the GitHub Issues. Our analysis revealed that ChatGPT is primarily utilized for ideation, whereas its usage for validation (e.g., code documentation accuracy) is minimal. Second, we applied BERTopic modeling to identify key areas of engagement on the entire dataset. We found that backend issues (e.g., API management) dominate conversations, while testing is surprisingly less covered. Third, we utilized the CPD clone detection tool to check if the code generated by ChatGPT was used to address issues. Our findings revealed that ChatGPT-generated code was used as-is to resolve only 5.83\% of the issues. Fourth, we estimated sentiment using a RoBERTa-based sentiment analysis model to determine developers' satisfaction with different usages and engagement areas. We found positive sentiment (i.e., high satisfaction) about using ChatGPT for refactoring and addressing data analytics (e.g., categorizing table data) issues. On the contrary, we observed negative sentiment when using ChatGPT to debug issues and address automation tasks (e.g., GUI interactions). Our findings show the unmet needs and growing dissatisfaction among developers. Researchers and ChatGPT developers should focus on developing task-specific solutions that help resolve diverse issues, improving user satisfaction and problem-solving efficiency in software development.
翻译:以ChatGPT为代表的大型语言模型已展现出辅助开发者完成编码与调试任务的潜力。然而,其在协作式问题解决中的作用尚未得到充分探索。本研究通过分析GitHub上1,012个问题中的1,152段开发者-ChatGPT对话,系统考察了ChatGPT的多样化使用方式及其生成代码的依赖程度。我们的贡献包含四个方面。首先,我们通过人工分析289段对话,深入理解了ChatGPT在GitHub Issues中的使用模式。分析表明,ChatGPT主要用于构思阶段,而在验证环节(如代码文档准确性核查)的使用则相对有限。其次,我们应用BERTopic建模技术对整个数据集中的关键交互领域进行识别。研究发现,后端问题(如API管理)在对话中占据主导地位,而测试相关议题的覆盖度却出人意料地偏低。第三,我们采用CPD克隆检测工具来验证ChatGPT生成的代码是否被实际用于解决问题。结果显示,仅5.83%的问题直接原样采用了ChatGPT生成的代码进行修复。第四,我们基于RoBERTa的情感分析模型评估了开发者对不同使用场景及交互领域的满意度。研究发现,在代码重构和处理数据分析(如表格数据分类)问题时使用ChatGPT获得了积极情感反馈(即较高满意度)。相反,当使用ChatGPT进行调试和处理自动化任务(如GUI交互)时,我们观察到了消极情感倾向。这些发现揭示了开发者尚未被满足的需求以及日益增长的不满情绪。研究者和ChatGPT开发者应致力于开发针对特定任务的解决方案,以帮助解决多样化问题,从而提升软件开发中的用户满意度和问题解决效率。