NeuroTrace: Inference Provenance-Based Detection of Adversarial Examples

Deep neural networks (DNNs) remain largely opaque at inference time, limiting our ability to detect and diagnose malicious input manipulations such as adversarial examples. Existing detection methods predominantly rely on layer-local signals (e.g., activations or attribution scores), leaving cross-layer information flow and execution structure under-explored. We introduce NeuroTrace, a framework and open dataset for analyzing inference provenance through Inference Provenance Graphs (IPGs). IPGs are heterogeneous graphs that capture both activation behavior and parameter-induced dataflow during a model's forward pass, providing a structured representation of how information propagates through the network. NeuroTrace includes (i) a reproducible extraction engine that instruments model execution, (ii) a standardized graph representation compatible with heterogeneous GNNs, and (iii) a benchmark suite spanning multiple adversarial attack families across vision and malware domains. Using this framework, we evaluate IPG-based detectors for adversarial example detection under intra-attack, multi-attack, and cross-threat transfer settings. Our results show that inference provenance provides a strong and transferable signal for distinguishing adversarial and benign inputs, achieving consistently high detection performance and improving over prior graph-based baselines. We further analyze the conditions under which provenance-based detection generalizes across attack types, as well as the associated runtime and storage trade-offs. By releasing the dataset, extraction pipeline, and evaluation protocol, NeuroTrace enables systematic study of inference-time behavior and establishes inference provenance as a practical foundation for building more transparent and auditable machine learning systems.

翻译：深度神经网络（DNN）在推理时仍高度不透明，限制了我们对恶意输入操纵（如对抗样本）的检测与诊断能力。现有检测方法主要依赖逐层局域信号（如激活值或归因分数），对跨层信息流与执行结构的研究尚不充分。我们提出NeuroTrace——一个通过推理来源图（IPGs）分析推理来源的框架与开放数据集。IPG是一种异构图，在模型前向传播过程中同时捕获激活行为与参数诱导的数据流，提供信息在网络中传播的结构化表征。NeuroTrace包含：（i）可复现的模型执行检测引擎，（ii）兼容异构图神经网络的标准化图表示，以及（iii）覆盖视觉与恶意软件领域多种对抗攻击族系的基准测试套件。利用该框架，我们评估了基于IPG的检测器在单攻击、多攻击及跨威胁迁移场景下的对抗样本检测性能。结果表明，推理来源能为区分对抗性输入与良性输入提供强效且可迁移的信号，实现持续高检测性能，并优于此前基于图的基线方法。我们进一步分析了来源检测在不同攻击类型间泛化的条件及其运行时与存储开销的权衡。通过发布数据集、提取流程与评估协议，NeuroTrace实现了对推理时行为的系统性研究，为构建更透明、可审计的机器学习系统奠定了基于推理来源的实践基础。