Vision models can achieve strong performance on classification tasks, but the internal representations supporting their predictions are often difficult to interpret. This work investigates whether sparse autoencoders can decompose intermediate representations of a vision model into interpretable features. We train a ConvNeXt classifier on the FGVC-Aircraft dataset, extract spatial activations from its final feature stage, and train a sparse autoencoder on these activations. The learned sparse features are analyzed using top-activating image patches, activation strength, and class selectivity. Qualitative visual inspection reveals that several features correspond to recognizable aircraft structures and visual patterns. We evaluate a subset of selected features using input-space and feature-space ablations, measuring how blurring image patches and suppressing sparse features affect class logits, classification margins, and prediction confidence. The results suggest that sparse autoencoders can reveal partially interpretable, class-relevant visual features associated with aircraft recognition, while also exposing limitations such as polysemanticity and coarse spatial localization.
翻译:视觉模型在分类任务上可表现出优异性能,但其支撑预测的内部表征往往难以解释。本研究探索稀疏自编码器能否将视觉模型的中间表征分解为可解释特征。我们在FGVC-Aircraft数据集上训练ConvNeXt分类器,从其最终特征阶段提取空间激活值,并基于这些激活值训练稀疏自编码器。通过最高激活图像块、激活强度和类别选择性对学习到的稀疏特征进行分析。定性视觉观察表明,多个特征对应可识别的飞机结构与视觉模式。我们采用输入空间与特征空间消融方法评估选定特征子集,通过测量图像块模糊与稀疏特征抑制对类别对数概率、分类间隔及预测置信度的影响。实验结果表明,稀疏自编码器可揭示与飞机识别相关的部分可解释、类别相关的视觉特征,同时暴露出多义性与空间定位粗糙等局限性。