TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability

Requirements traceability, the process of establishing and maintaining relationships between requirements and various software development artifacts, is paramount for ensuring system integrity and fulfilling requirements throughout the Software Development Life Cycle (SDLC). Traditional methods, including manual and information retrieval models, are labor-intensive, error-prone, and limited by low precision. Recently, Large Language Models (LLMs) have demonstrated potential for supporting software engineering tasks through advanced language comprehension. However, a substantial gap exists in the systematic design and evaluation of prompts tailored to extract accurate trace links. This paper introduces TraceLLM, a systematic framework for enhancing requirements traceability through prompt engineering and demonstration selection. Our approach incorporates rigorous dataset splitting, iterative prompt refinement, enrichment with contextual roles and domain knowledge, and evaluation across zero- and few-shot settings. We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets representing diverse domains (aerospace, healthcare) and artifact types (requirements, design elements, test cases, regulations). TraceLLM achieves state-of-the-art F2 scores, outperforming traditional IR baselines, fine-tuned models, and prior LLM-based methods. We also explore the impact of demonstration selection strategies, identifying label-aware, diversity-based sampling as particularly effective. Overall, our findings highlight that traceability performance depends not only on model capacity but also critically on the quality of prompt engineering. In addition, the achieved performance suggests that TraceLLM can support semi-automated traceability workflows in which candidate links are reviewed and validated by human analysts.

翻译：需求可追踪性是在需求与各类软件开发制品间建立并维护关联的过程，对于确保系统完整性及在软件开发生命周期中实现需求满足至关重要。传统方法（包括人工操作与信息检索模型）存在劳动密集、易出错且受限于低精确度等问题。近年来，大型语言模型通过其先进的语言理解能力，在支持软件工程任务方面展现出潜力。然而，针对提取准确追踪链路的提示词系统化设计与评估仍存在显著空白。本文提出TraceLLM——一个通过提示工程与示例选择增强需求可追踪性的系统化框架。该方法融合严格的数据集划分、迭代式提示优化、结合语境角色与领域知识的增强策略，并在零样本与少样本场景下进行评估。我们在代表不同领域（航空航天、医疗健康）与制品类型（需求、设计元素、测试用例、法规）的四个基准数据集上，使用八个前沿大型语言模型验证了提示词的泛化性与鲁棒性。TraceLLM取得了当前最优的F2分数，超越了传统信息检索基线方法、微调模型及先前基于大型语言模型的方法。我们还探究了示例选择策略的影响，发现基于标签感知与多样性的采样方法尤为有效。总体而言，我们的研究结果表明：可追踪性性能不仅取决于模型能力，更关键地依赖于提示工程的质量。此外，所实现的性能表明TraceLLM能够支持半自动化可追踪性工作流，即由人工分析师对候选链路进行审查与验证。