Classification or Prompting: A Case Study on Legal Requirements Traceability

New regulations are introduced to ensure software development aligns with ethical concerns and protects public safety. Showing compliance requires tracing requirements to legal provisions. Requirements traceability is a key task where engineers must analyze technical requirements against target artifacts, often within limited time. Manually analyzing complex systems with hundreds of requirements is infeasible. The legal dimension adds challenges that increase effort. In this paper, we investigate two automated solutions based on language models, including large ones (LLMs). The first solution, Kashif, is a classifier that leverages sentence transformers and semantic similarity. The second solution, RICE_LRT, prompts a recent LLM based on RICE, a prompt engineering framework. Using a publicly available benchmark dataset, we empirically evaluate Kashif and compare it against seven baseline classifiers from the literature (LSI, LDA, GloVe, TraceBERT, RoBERTa, and LLaMa). Kashif can identify trace links with F2 score of 63%, outperforming the best baseline by a substantial margin of 21 percentage points (pp) in F2 score. On a newly created and more complex requirements document traced to the European general data protection regulation (GDPR), RICE_LRT outperforms Kashif and baseline prompts in the literature by achieving an average recall of 84% and F2 score of 61%, improving the F2 score by 34 pp compared to the best baseline prompt. Our results indicate that requirements traceability in legal contexts cannot be adequately addressed by techniques proposed in the literature that are not specifically designed for legal artifacts. Furthermore, we demonstrate that our engineered prompt outperforms both classifier-based approaches and baseline prompts.

翻译：为确保软件开发符合伦理关切并保障公共安全，新法规不断出台。证明合规性需要将需求追溯至法律条文。需求可追踪性是一项关键任务，工程师必须在有限时间内对照目标工件分析技术需求。对包含数百项需求的复杂系统进行人工分析并不可行。法律维度的引入进一步增加了工作难度。本文研究了两种基于语言模型（包括大语言模型）的自动化解决方案。第一种方案Kashif是一种分类器，利用句子Transformer和语义相似性技术。第二种方案RICE_LRT基于提示工程框架RICE，对最新大语言模型进行提示调优。使用公开可得的基准数据集，我们对Kashif进行实证评估，并与文献中的七个基线分类器（LSI、LDA、GloVe、TraceBERT、RoBERTa和LLaMa）进行比较。Kashif能以63%的F2分数识别追踪链接，在F2分数上以21个百分点的显著优势超越最佳基线模型。在新构建且更复杂的、追溯至欧盟《通用数据保护条例》（GDPR）的需求文档上，RICE_LRT实现了84%的平均召回率和61%的F2分数，相比最佳基线提示方法将F2分数提升了34个百分点，其表现优于Kashif及文献中的基线提示方法。研究结果表明，文献中提出的非专门针对法律工件设计的技术，无法充分解决法律语境下的需求可追踪性问题。此外，我们证明经过精心设计的提示方法在性能上超越了基于分类器的方法和基线提示方法。