What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better results than ConvNets or ViTs on several benchmarks. Codes are available at: https://github.com/ma-xu/Context-Cluster.
翻译:什么是图像,又如何提取潜在特征?卷积网络(ConvNets)将图像视为矩形排列的规则像素,并通过局部区域的卷积运算提取特征;视觉Transformer(ViTs)则将图像视为补丁序列,并通过全局范围的注意力机制提取特征。本研究提出一种名为“上下文聚类”(Context Clusters, CoCs)的直观且极具潜力的视觉表征范式。CoCs将图像视为一组无序的点集,并借助简化聚类算法提取特征。具体而言,每个点包含原始特征(如颜色)与位置信息(如坐标),通过简化聚类算法分层地分组并提取深层特征。我们的CoCs无需卷积或注意力机制,仅依赖聚类算法实现空间交互。由于设计简洁,我们证明CoCs可通过聚类过程的可视化赋予令人满意的可解释性。CoCs旨在为图像与视觉表征提供全新视角,有望在不同领域实现广泛应用并蕴含深刻洞见。尽管未以追求最先进性能(SOTA)为目标,CoCs在多个基准测试中仍能达到甚至超越ConvNets或ViTs的表现。代码已开源:https://github.com/ma-xu/Context-Cluster。