Adversarial examples, designed to trick Artificial Neural Networks (ANNs) into producing wrong outputs, highlight vulnerabilities in these models. Exploring these weaknesses is crucial for developing defenses, and so, we propose a method to assess the adversarial robustness of image-classifying ANNs. The t-distributed Stochastic Neighbor Embedding (t-SNE) technique is used for visual inspection, and a metric, which compares the clean and perturbed embeddings, helps pinpoint weak spots in the layers. Analyzing two ANNs on CIFAR-10, one designed by humans and another via NeuroEvolution, we found that differences between clean and perturbed representations emerge early on, in the feature extraction layers, affecting subsequent classification. The findings with our metric are supported by the visual analysis of the t-SNE maps.
翻译:对抗样本旨在诱使人工神经网络(ANNs)产生错误输出,揭示了这些模型的脆弱性。探究这些弱点对开发防御机制至关重要,为此我们提出一种评估图像分类人工神经网络对抗鲁棒性的方法。本研究采用t分布随机邻域嵌入(t-SNE)技术进行可视化检测,并通过比较干净样本与扰动样本嵌入表示的度量指标,精确定位网络各层中的薄弱环节。通过对CIFAR-10数据集上两个人工神经网络(一个由人工设计,另一个通过神经进化算法生成)的分析,我们发现干净样本与扰动样本的表征差异在特征提取层早期就已出现,并影响后续分类性能。我们提出的度量指标所得结论,均得到了t-SNE可视化图谱的实证支持。