Cyber-physical systems (CPSs) are characterized by a deep integration of the information space and the physical world, which makes the extraction of requirements concerns more challenging. Some automated solutions for requirements concern extraction have been proposed to alleviate the burden on requirements engineers. However, evaluating the effectiveness of these solutions, which relies on fair and comprehensive benchmarks, remains an open question. To address this gap, we propose ReqEBench, a new CPSs requirements concern extraction benchmark, which contains 2,721 requirements from 12 real-world CPSs. ReqEBench offers four advantages. It aligns with real-world CPSs requirements in multiple dimensions, e.g., scale and complexity. It covers comprehensive concerns related to CPSs requirements. It undergoes a rigorous annotation process. It covers multiple application domains of CPSs, e.g., aerospace and healthcare. We conducted a comparative study on three types of automated requirements concern extraction solutions and revealed their performance in real-world CPSs using our ReqEBench. We found that the highest F1 score of GPT-4 is only 0.24 in entity concern extraction. We further analyze failure cases of popular LLM-based solutions, summarize their shortcomings, and provide ideas for improving their capabilities. We believe ReqEBench will facilitate the evaluation and development of automated requirements concern extraction.
翻译:信息物理系统(CPSs)以信息空间与物理世界的深度融合为特征,这使得需求关注点的提取更具挑战性。为减轻需求工程师的负担,已有一些自动化的需求关注点提取方案被提出。然而,如何评估这些方案的有效性——这依赖于公平且全面的基准——仍然是一个悬而未决的问题。为填补这一空白,我们提出了ReqEBench,一个新的CPS需求关注点提取基准,它包含了来自12个真实世界CPS的2,721条需求。ReqEBench具有四个优势:它在规模和复杂性等多个维度上与真实世界CPS需求保持一致;它涵盖了与CPS需求相关的全面关注点;它经过了严格的标注流程;它覆盖了CPS的多个应用领域,例如航空航天和医疗保健。我们利用ReqEBench对三类自动化需求关注点提取方案进行了比较研究,并揭示了它们在真实世界CPS中的性能表现。我们发现,GPT-4在实体关注点提取中的最高F1分数仅为0.24。我们进一步分析了基于流行大语言模型(LLM)解决方案的失败案例,总结了其不足,并提出了提升其能力的思路。我们相信ReqEBench将促进自动化需求关注点提取的评估与发展。