In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.
翻译:本报告介绍了HyperCLOVA X 8B Omni,这是HyperCLOVA X系列中首个支持文本、音频和视觉作为输入与输出的任意模态全能模型。通过将多模态理解与生成整合到单一模型中,而非采用独立的模态专用流水线,HyperCLOVA X 8B Omni作为一个80亿参数规模的全能路径探索点,为实现实用的任意模态全能助手奠定了基础。在高层设计上,该模型通过共享的交错多模态序列下一词元预测接口统一各模态,同时利用视觉与音频编码器注入连续嵌入以实现细粒度理解与语义对齐。实证评估表明,在涵盖韩语和英语的文本、音频及视觉的多样化输入输出组合任务中,该模型相较于同等规模模型展现出具有竞争力的性能。我们预计,HyperCLOVA X 8B Omni的开放权重发布将支持广泛的研究与部署场景。