We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient X/G is not a submanifold of X and cannot be obtained through smooth deformation alone, semantic invariance requires a non homeomorphic, discriminative target for example, supervision via labels, cross-instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand and snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.
翻译:我们从结构和拓扑的视角研究视觉表征学习。本文从一个基本假说出发:视觉理解预设了一种视觉语义语言,其中大量感知观测对应于少量离散语义状态。结合表征学习中广泛接受的迁移性与抽象性前提,该假说意味着视觉观测空间必然呈现纤维丛状结构——干扰变化分布于纤维,而语义对应于商基空间。基于此结构,我们推导出两个理论推论:首先,语义商空间 X/G 并非 X 的子流形,无法仅通过光滑形变获得;语义不变性需要非同胚的判别性目标,例如通过标签监督、跨实例识别或多模态对齐提供显式语义等价。其次,我们证明逼近该商空间也对模型架构提出结构性要求:语义抽象不仅需要外部语义目标,更需要支持拓扑变化的表征机制——即“扩展-坍缩”过程:流形先经几何扩展以分离结构,再通过坍缩形成离散语义区域。需要强调的是,这些结论是解释性而非规范性的:该框架提供了与大规模判别模型、多模态模型中经验规律相契合的拓扑视角,并与统计学习理论的经典原理相一致。