PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

翻译：许多对感知数据的操作——比较、记忆、检索和推理——自然地通过离散符号结构来表达。在语言中，这种接口由标记提供；在音频中，它必须被学习。现有的音频标记器依赖于量化、聚类或编解码器重建，局部地分配标记，因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign，一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成：编码器将语音映射到连续条件，自回归解码器从BOS生成标记，学习标记的身份、顺序、长度和EOS位置。给定两个内容保持的视图，每个视图的序列被训练为在另一视图的表示下具有高概率，而不相关的示例提供竞争序列。这为编辑距离保持提供了一个可扩展的替代方案，同时阻止多对一坍塌。PairAlign从VQ风格的标记化开始，并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制对其进行优化。在3秒语音上，PairAlign学习紧凑、非退化的序列，具有广泛的词汇使用和强大的跨视图一致性。在检索测试中，它在保持编辑距离搜索的同时，将存档标记数量减少55%。一个连续扫描探针显示，与密集几何标记器相比，局部重叠更低，但在100毫秒偏移下具有更强的长度控制和有界的编辑轨迹。PairAlign是一个序列符号预测学习器：类似于JEPA风格的目标，它从另一个视图预测一个抽象目标，作为一个学习到的可变长度符号序列，而不是一个连续潜变量。