In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.
翻译:在语音语言建模领域,Transformer与Conformer两种架构主导着前沿研究。然而,二者性能相当的现象究竟源于趋同的处理策略,还是源自不同架构的归纳偏置,目前尚不明确。我们提出了“架构指纹识别”方法,这是一种能够分离架构对表征影响的探测框架,并将其应用于一套包含24个预训练编码器(参数量39M-3.3B)的受控实验组。分析揭示了分化的层级处理模式:Conformer采用“早期分类”策略,其音素类别识别深度提前29%,说话者性别识别深度提前16%。相比之下,Transformer则表现为“晚期整合”,将音素、口音和时长等信息的编码推迟至深层网络(49-57%深度)。这些架构指纹暗示了设计启发:Conformer的前端分类特性可能有利于低延迟流式处理,而Transformer的深度整合特性则可能更适合需要丰富上下文和跨话语归一化的任务。