Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.
翻译:大型语言模型(LLM)可用于从自然语言输入(如需求获取访谈的文字记录)中自动化生成软件需求。然而,评估这些衍生出的需求是否忠实反映利益相关者的需求,在很大程度上仍是一项手动任务。我们提出了Text2Stories,这是一项用于文本到故事对齐的任务及度量标准,旨在量化需求(以用户故事形式呈现)与需求获取会议参与者所表达的实际需求之间的匹配程度。给定一份访谈记录和一组用户故事,我们的度量标准量化了(i)正确性:由访谈记录支持的故事比例,以及(ii)完整性:至少被一个故事支持的访谈记录比例。我们将访谈记录分割为文本块,并将对齐问题实例化为文本块与故事之间的匹配问题。在四个数据集上的实验表明,基于LLM的匹配器在留出标注上达到了0.86的宏观F1分数,而仅使用嵌入模型的方法虽然落后,但能实现有效的分块筛选。最后,我们展示了我们的度量标准如何支持跨故事集(例如,人工编写与自动生成)的比较,从而将Text2Stories定位为现有用户故事质量标准的、可扩展且忠实于来源的补充。