Software requirements are derived from a variety of elicitation techniques, many of which have a conversational nature, like interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a challenging manual task. In this paper, we formalize the task of aligning the transcript of an interview with a collection of requirements represented as user stories. We propose two heuristic metrics for alignment, called (i) requirements faithfulness: the proportion of stories supported by the transcript, and (ii) interview coverage: the proportion of transcript supported by at least one story. Then, we run experiments with large language models and embedding models that assess the ability of evaluating these metrics automatically. Experiments over four datasets show that an LLM-based solution achieves 0.86 macro-F1 on manually labeled chunk-story pairs. We also show how embedding models can be used as blockers to make the approach more scalable. This work paves the way for more research on linking conversational artifacts with requirements. The formal framework and the automated matching techniques are basic components that can be used for emerging tasks such as tracing requirements to interviews and generating requirements from conversations.
翻译:软件需求来源于多种启发技术,其中许多技术具有对话性质,例如访谈。然而,评估这些派生需求是否忠实反映利益相关者的需求仍然是一项具有挑战性的手动任务。在本文中,我们形式化了将访谈转录文本与以用户故事形式表示的需求集合对齐的任务。我们提出了两种用于对齐的启发式度量标准,即(i)需求忠实度:由转录文本支持的故事比例,以及(ii)访谈覆盖率:至少由一个故事支持的转录文本比例。随后,我们使用大型语言模型和嵌入模型进行实验,评估自动计算这些度量标准的能力。在四个数据集上的实验表明,基于LLM的解决方案在手动标注的块-故事对上达到了0.86的宏F1分数。我们还展示了嵌入模型如何作为阻断器使用,以使该方法更具可扩展性。这项工作为更多关于对话工件与需求之间关联的研究铺平了道路。形式化框架和自动匹配技术是基础组件,可用于新兴任务,例如将需求追溯至访谈以及从对话中生成需求。