Prior work on probing neural networks primarily relies on input-space analysis or parameter perturbation, both of which face fundamental limitations in accessing structural information encoded in intermediate representations. We introduce Activation Perturbation for EXploration (APEX), an inference-time probing paradigm that perturbs hidden activations while keeping both inputs and model parameters fixed. We theoretically show that activation perturbation induces a principled transition from sample-dependent to model-dependent behavior by suppressing input-specific signals and amplifying representation-level structure, and further establish that input perturbation corresponds to a constrained special case of this framework. Through representative case studies, we demonstrate the practical advantages of APEX. In the small-noise regime, APEX provides a lightweight and efficient measure of sample regularity that aligns with established metrics, while also distinguishing structured from randomly labeled models and revealing semantically coherent prediction transitions. In the large-noise regime, APEX exposes training-induced model-level biases, including a pronounced concentration of predictions on the target class in backdoored models. Overall, our results show that APEX offers an effective perspective for exploring, and understanding neural networks beyond what is accessible from input space alone.
翻译:先前关于神经网络探测的研究主要依赖于输入空间分析或参数扰动,这两种方法在获取中间表示所编码的结构信息方面均面临根本性局限。我们提出用于探索的激活扰动(APEX),这是一种推理时探测范式,它在保持输入和模型参数不变的情况下扰动隐藏激活。我们从理论上证明,激活扰动通过抑制输入特定信号并放大表示层面的结构,能够引导模型行为从样本依赖性向模型依赖性发生原则性转变,并进一步论证输入扰动对应于该框架的一个受限特例。通过代表性案例研究,我们展示了APEX的实际优势。在小噪声机制下,APEX提供了一种轻量高效的样本规律性度量,该度量与既有指标一致,同时能区分结构化模型与随机标注模型,并揭示语义一致的预测转变。在大噪声机制下,APEX暴露了训练诱导的模型层面偏差,包括后门模型中预测结果向目标类别的显著集中现象。总体而言,我们的研究结果表明,APEX为探索和理解神经网络提供了超越输入空间局限的有效视角。