AutoDFBench 1.0：数字取证工具测试与生成代码评估的基准测试框架 (AutoDFBench 1.0: A Benchmarking Framework for Digital Forensic Tool Testing and Generated Code Evaluation)

The National Institute of Standards and Technology (NIST) Computer Forensic Tool Testing (CFTT) programme has become the de facto standard for providing digital forensic tool testing and validation. However to date, no comprehensive framework exists to automate benchmarking across the diverse forensic tasks included in the programme. This gap results in inconsistent validation, challenges in comparing tools, and limited validation reproducibility. This paper introduces AutoDFBench 1.0, a modular benchmarking framework that supports the evaluation of both conventional DF tools and scripts, as well as AI-generated code and agentic approaches. The framework integrates five areas defined by the CFTT programme: string search, deleted file recovery, file carving, Windows registry recovery, and SQLite data recovery. AutoDFBench 1.0 includes ground truth data comprising of 63 test cases and 10,968 unique test scenarios, and execute evaluations through a RESTful API that produces structured JSON outputs with standardised metrics, including precision, recall, and F1~score for each test case, and the average of these F1~scores becomes the AutoDFBench Score. The benchmarking framework is validated against CFTT's datasets. The framework enables fair and reproducible comparison across tools and forensic scripts, establishing the first unified, automated, and extensible benchmarking framework for digital forensic tool testing and validation. AutoDFBench 1.0 supports tool vendors, researchers, practitioners, and standardisation bodies by facilitating transparent, reproducible, and comparable assessments of DF technologies.

翻译：美国国家标准与技术研究院（NIST）的计算机取证工具测试（CFTT）项目已成为提供数字取证工具测试与验证的事实标准。然而，迄今为止，尚不存在一个全面的框架来自动化执行该项目所涵盖的各类取证任务的基准测试。这一空白导致了验证结果的不一致、工具比较的困难以及验证可重复性的局限。本文介绍了AutoDFBench 1.0，这是一个模块化的基准测试框架，支持对传统数字取证工具与脚本以及AI生成代码和智能体方法的评估。该框架整合了CFTT项目定义的五个领域：字符串搜索、已删除文件恢复、文件雕刻、Windows注册表恢复和SQLite数据恢复。AutoDFBench 1.0包含由63个测试用例和10,968个独特测试场景组成的真实基准数据，并通过RESTful API执行评估，生成结构化的JSON输出，其中包含每个测试用例的标准化指标，如精确率、召回率和F1分数，这些F1分数的平均值即构成AutoDFBench评分。该基准测试框架已使用CFTT的数据集进行了验证。该框架实现了跨工具和取证脚本的公平且可重复的比较，建立了首个统一、自动化且可扩展的数字取证工具测试与验证基准测试框架。AutoDFBench 1.0通过促进对数字取证技术进行透明、可重复且可比较的评估，为工具供应商、研究人员、从业者和标准化机构提供支持。