Hybrid Gromov-Wasserstein Embedding for Capsule Learning

Capsule networks (CapsNets) aim to parse images into a hierarchy of objects, parts, and their relations using a two-step process involving part-whole transformation and hierarchical component routing. However, this hierarchical relationship modeling is computationally expensive, which has limited the wider use of CapsNet despite its potential advantages. The current state of CapsNet models primarily focuses on comparing their performance with capsule baselines, falling short of achieving the same level of proficiency as deep CNN variants in intricate tasks. To address this limitation, we present an efficient approach for learning capsules that surpasses canonical baseline models and even demonstrates superior performance compared to high-performing convolution models. Our contribution can be outlined in two aspects: firstly, we introduce a group of subcapsules onto which an input vector is projected. Subsequently, we present the Hybrid Gromov-Wasserstein framework, which initially quantifies the dissimilarity between the input and the components modeled by the subcapsules, followed by determining their alignment degree through optimal transport. This innovative mechanism capitalizes on new insights into defining alignment between the input and subcapsules, based on the similarity of their respective component distributions. This approach enhances CapsNets' capacity to learn from intricate, high-dimensional data while retaining their interpretability and hierarchical structure. Our proposed model offers two distinct advantages: (i) its lightweight nature facilitates the application of capsules to more intricate vision tasks, including object detection; (ii) it outperforms baseline approaches in these demanding tasks.

翻译：胶囊网络（CapsNets）旨在通过“部分-整体变换”与“层级组件路由”两步过程，将图像解析为对象、部件及其关系的层次结构。然而，这种层级关系建模的计算成本过高，尽管胶囊网络具有潜在优势，但仍限制了其广泛应用。当前胶囊网络模型的研究主要集中于与胶囊基线模型进行性能比较，在复杂任务中尚未达到深度CNN变体同等水平。为解决这一局限，我们提出了一种高效的胶囊学习方法，该方法不仅超越经典基线模型，甚至展现出优于高性能卷积模型的性能。本文贡献可分为两方面：首先，我们引入一组子胶囊，输入向量被投影至这些子胶囊上；随后，我们提出混合Gromov-Wasserstein框架，该框架通过最优传输机制，先量化输入与子胶囊所建模组件之间的差异度，再确定其对齐程度。这一创新机制基于输入与子胶囊各自组件分布的相似性，为定义两者之间的对齐关系提供了新思路。该方法在保持胶囊网络可解释性与层次结构的同时，增强了其从复杂高维数据中学习的能力。我们提出的模型具有两大优势：（i）其轻量化特性使胶囊可应用于更复杂的视觉任务（如目标检测）；（ii）在这些高难度任务中，其性能优于基线方法。