We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.
翻译:我们对135名软件工程从业者进行了调查,以了解他们在软件工程任务中使用ChatGPT等生成式AI聊天机器人的方式。研究发现,从业者希望将ChatGPT用于软件库选择等软件工程任务,但经常担忧ChatGPT回复的真实性。我们开发了一套技术方案及名为CID(ChatGPT错误检测器)的工具,用于自动测试并检测ChatGPT回复中的错误。CID基于迭代式提示技术,通过向ChatGPT提出语境相似但文本表述不同的问题(采用文本蜕变关系方法)实现检测。其核心原理是:对于同一问题,若某条回复与其他回复(针对同一问题的多轮变体)存在差异,则该回复很可能存在错误。在软件库选择的基准测试中,CID检测ChatGPT错误回复的F1分数达到0.74-0.75。