Vector-quantized representations enable powerful discrete generative models but lack semantic structure in token space, limiting interpretable human control. We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks with explicit low-dimensional topology. Unlike standard VQ-VAE, SOM-VQ uses topology-aware updates that preserve neighborhood structure: nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space. We demonstrate that SOM-VQ produces more learnable token sequences in the evaluated domains while providing an explicit navigable geometry in code space. Critically, the topological organization enables intuitive human-in-the-loop control: users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints. We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases (choreography, rehabilitation, HCI) make topology-aware control especially natural - demonstrating controlled divergence and convergence from reference sequences through simple grid-based sampling. SOM-VQ provides a general framework for interpretable discrete representations applicable to music, gesture, and other interactive generative domains.
翻译:向量量化表示能够实现强大的离散生成模型,但其在词元空间中缺乏语义结构,限制了可解释的人为控制。我们提出了SOM-VQ,这是一种将向量量化与自组织映射相结合的分词方法,用于学习具有显式低维拓扑结构的离散码本。与标准VQ-VAE不同,SOM-VQ采用保持邻域结构的拓扑感知更新方式:学习网格上相邻的词元对应于语义相似的状态,从而实现对潜在空间的直接几何操控。我们证明,SOM-VQ在所评估的领域中能产生更具可学习性的词元序列,同时在码空间中提供显式的可导航几何结构。关键在于,这种拓扑组织实现了直观的人机交互控制:用户可以通过操控词元空间中的距离来引导生成过程,无需帧级约束即可实现语义对齐。我们专注于人体运动生成这一领域——其运动学结构、平滑的时间连续性以及交互式应用场景(编舞、康复、人机交互)使得拓扑感知控制尤为自然——通过简单的基于网格的采样,展示了相对于参考序列的可控发散与收敛。SOM-VQ为音乐、手势及其他交互式生成领域提供了一种适用于可解释离散表示的通用框架。