Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.
翻译:整体化视觉分词器是统一多模态模型的基础,因其能将多样化视觉输入映射至统一的表征空间。本文提出HYDRA-X——首个在单一视觉变换器内实现图像与视频统一分词化的统一多模态模型。我们的设计源于两大核心挑战:高效地将时空重建能力注入原生视觉变换器,以及将图像级与视频级语义感知嵌入潜空间。针对前者,全面消融实验揭示两个关键发现:(1)帧级因果时序注意力足以支撑视觉重建,而全时空注意力反而会降低重建质量;(2)层级式时序压缩显著优于单步压缩方案。针对后者,我们提出轻量化解压缩器,该模块在图像-视频联合教师监督下对时序压缩特征进行上采样,从而在紧凑潜空间中强制形成互补语义结构。基于该整体化分词器,我们进一步提出编辑管道的原则性改进:源-目标交互应发生在分词器内部的潜空间层级而非大语言模型内部的语义层级,从而显著提升编辑一致性并加速收敛。在7B稠密模型上实例化的HYDRA-X在图像与视频理解及生成任务中均展现了强劲性能,为未来统一分词器型多模态模型开辟了道路。