Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through video-text-to-speech (VTTS) synthesis-a controlled task requiring fine-grained temporal alignment between sparse text, video, and continuous speech. Using a unified decoder-only transformer, dubbed Visatronic, trained on VoxCeleb2, we study: (i) how modalities contribute complementary information, (ii) how positional encoding strategies enable synchronization across heterogeneous rates, (iii) how modality ordering shapes the trade-off between in-domain performance and cross-domain transfer, (iv) how phoneme-level synchronization metrics provide diagnostic insight into per-phoneme timing errors. Our findings reveal that both "global sequential indexing'' (unique position IDs across modalities) and "co-temporal ordered indexing'' (identical IDs for temporally corresponding tokens) achieve strong synchronization performance, with co-temporal ordered indexing providing a simple mechanism without explicit timestamp metadata. Both text and video contribute complementary signals: text ensures intelligibility while video provides temporal cues and emotional expressiveness. Modality ordering reveals a consistent trade-off: video-first ordering achieves stronger in-domain performance while text-first ordering generalizes more robustly to unseen domains. Our findings also reveal, that diverse large-scale training enables transferable synchronization strategies. To enable fine-grained analysis, we also introduce TimeSync, a phoneme-level metric that reveals temporal misalignments overlooked by frame-level metrics. These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders.

翻译：统一解码器型Transformer在多模态生成任务中展现出潜力，但此类模型如何同步不同采样率的多模态数据仍待深入探究。本文通过视频-文本-语音合成这一受控任务——要求稀疏文本、视频与连续语音之间实现精细时间对齐，系统研究了其同步机制。基于VoxCeleb2数据集训练的、名为Visatronic的统一解码器型Transformer，我们重点研究：(i) 各模态如何提供互补信息，(ii) 位置编码策略如何实现异构采样率同步，(iii) 模态排序如何影响领域内性能与跨领域迁移的权衡，(iv) 音素级同步指标如何提供诊断性洞察（针对逐音素时序偏差）。研究发现，"全局序列索引"（为跨模态赋予唯一位置标识）与"共时有序索引"（为时序对应标记赋予相同标识）均能实现强劲的同步性能，后者无需显式时间戳元数据即可提供简洁的同步机制。文本与视频分别贡献互补信号：文本确保可理解性，视频则提供时序线索与情感表达力。模态排序呈现一致权衡：视频优先排序在领域内表现更优，而文本优先排序对未见领域具有更强泛化能力。研究还表明，大规模多样化训练可催生可迁移的同步策略。为支持细粒度分析，我们提出新型音素级度量指标TimeSync，该指标可揭示帧级度量遗漏的时间偏移。这些发现确立了视频-文本-语音合成作为理解统一多模态解码器时序同步机制的重要测试平台。