We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a holistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).
翻译:我们提出了一种分析神经网络激活中所含分组信息的方法,该方法允许从大型预训练视觉模型的行为中提取空间布局和语义分割。与先前工作不同,我们的方法对网络的激活状态进行整体分析,利用所有层的特征,从而无需猜测模型的哪一部分包含相关信息。受经典光谱聚类的启发,我们将此分析表述为一个涉及一组亲和矩阵的优化目标,每个亲和矩阵通过比较不同层内的特征形成。使用梯度下降求解此优化问题使我们的技术能够从单张图像分析扩展到数据集级别的分析,后者包括图像内部和图像间的关系。分析一个预训练的生成式Transformer为理解此类模型学习到的计算策略提供了见解。将注意力层间的键-查询相似度等同于亲和度,可得到编码场景空间布局的特征向量;而通过值向量相似度定义亲和度,则产生编码对象身份的特征向量。这一结果表明,键和查询向量根据空间邻近性协调注意力信息流(一条“位置”通路),而值向量则细化语义类别表示(一条“内容”通路)。