SpecMap: Hierarchical LLM Agent for Datasheet-to-Code Traceability Link Recovery in Systems Engineering

Establishing precise traceability between embedded systems datasheets and their corresponding code implementations remains a fundamental challenge in systems engineering, particularly for low-level software where manual mapping between specification documents and large code repositories is infeasible. Existing Traceability Link Recovery approaches primarily rely on lexical similarity and information retrieval techniques, which struggle to capture the semantic, structural, and symbol level relationships prevalent in embedded systems software. We present a hierarchical datasheet-to-code mapping methodology that employs large language models for semantic analysis while explicitly structuring the traceability process across multiple abstraction levels. Rather than performing direct specification-to-code matching, the proposed approach progressively narrows the search space through repository-level structure inference, file-level relevance estimation, and fine-grained symbollevel alignment. The method extends beyond function-centric mapping by explicitly covering macros, structs, constants, configuration parameters, and register definitions commonly found in systems-level C/C++ codebases. We evaluate the approach on multiple open-source embedded systems repositories using manually curated datasheet-to-code ground truth. Experimental results show substantial improvements over traditional information-retrieval-based baselines, achieving up to 73.3% file mapping accuracy. We significantly reduce computational overhead, lowering total LLM token consumption by 84% and end-to-end runtime by approximately 80%. This methodology supports automated analysis of large embedded software systems and enables downstream applications such as training data generation for systems-aware machine learning models, standards compliance verification, and large-scale specification coverage analysis.

翻译：在系统工程中，建立嵌入式系统数据表与其对应代码实现之间的精确可追溯性仍然是一个根本性挑战，特别是在底层软件领域，由于规范文档与大型代码库之间的人工映射难以实现。现有的可追溯性链接恢复方法主要依赖于词汇相似性和信息检索技术，这些方法难以捕捉嵌入式系统软件中普遍存在的语义、结构和符号级别的关系。我们提出了一种分层的数据表到代码映射方法，该方法利用大语言模型进行语义分析，同时在多个抽象级别上显式地构建可追溯性过程。所提出的方法并非直接进行规范到代码的匹配，而是通过仓库级别的结构推断、文件级别的相关性估计以及细粒度的符号级别对齐，逐步缩小搜索空间。该方法超越了以函数为中心的映射，明确涵盖了系统级C/C++代码库中常见的宏、结构体、常量、配置参数和寄存器定义。我们在多个开源嵌入式系统代码库上，使用人工整理的数据表到代码基准真值对该方法进行了评估。实验结果表明，相较于传统基于信息检索的基线方法，该方法取得了显著改进，文件映射准确率最高可达73.3%。我们显著降低了计算开销，将总LLM令牌消耗降低了84%，端到端运行时间减少了约80%。该方法支持对大型嵌入式软件系统进行自动化分析，并能够实现下游应用，例如为具备系统意识的机器学习模型生成训练数据、进行标准符合性验证以及进行大规模规范覆盖分析。