A Data-driven Typology of Vision Models from Integrated Representational Metrics

Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

翻译：大型视觉模型在架构和训练范式上差异显著，然而我们缺乏系统性的方法来判定其表征的哪些方面在不同模型族之间共享，哪些反映了独特的计算策略。我们利用一套表征相似性度量指标——每个指标捕捉几何结构、单元调谐或线性可解码性等不同维度——并通过多种互补性度量评估模型族的可分离性。保持几何或调谐特性的度量指标（如RSA、软匹配）能实现较强的族间区分，而线性可预测性等灵活映射方法则呈现较弱的分离效果。这些发现表明几何结构与调谐特性携带模型族特异性特征，而线性可解码信息则具有更广泛的共享性。为整合这些互补维度，我们借鉴多组学整合思想，采用相似性网络融合方法。该方法获得的族间分离度显著优于任何单一指标，并能生成稳健的复合特征。对融合相似度矩阵的聚类分析揭示了预期与非预期模式：监督式ResNet与ViT形成独立聚类，而所有自监督模型跨越架构边界聚集为同一簇。混合架构模型（ConvNeXt、Swin）与掩码自编码器聚为一类，表明架构现代化与基于重建的训练范式正在趋同。这套受生物学启发的框架为视觉模型提供了系统化的分类体系，证明由架构与训练目标共同塑造的涌现计算策略，能够定义超越表面设计类别的表征结构。