MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.

翻译：MiniMind-O是一个基于MiniMind语言模型构建的开源0.1B规模全模态模型。该模型可接收文本、语音和图像输入，同时输出文本与流式语音。本次发布包含模型代码、检查点，以及用于文本转音频、图像转文本和音频转音频训练的主要Parquet格式数据集，使得完整交互流程可直接检视。模型采用完整MiniMind主干作为思考器，并配备由MiniMind模块构成的独立四层说话器。冻结的SenseVoice-Small和SigLIP2编码器提取语音和图像特征，通过轻量级MLP投影器映射后注入模态占位符位置。说话器读取思考器中间层状态，并结合自回归八层Mimi码缓冲区。说话人控制由专用说话人令牌、右对齐参考编解码提示及预计算CAM++说话人嵌入实现，使得声音调节作为音频码上下文的一部分，而非独立TTS模块。在768维说话器配置下，稠密变体与MoE变体在思考器-说话器一致性评估中平均字符错误率分别达到0.0897和0.0900，整体语音克隆相似度分别为0.5995与0.5937。除报告可运行系统外，本文指出了小规模全模态模型的三个规模关键设计选择：中间层语义桥接、已发布的多模态序列格式，以及参数高效的八码本接口。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

14+阅读 · 5月21日

Youtu-LLM：激发轻量化大语言模型原生的智能体潜力

专知会员服务

19+阅读 · 1月3日

OpenAI 发布推理模型o3-mini，附37页技术报告，中英文版

专知会员服务

48+阅读 · 2025年2月1日

MiniMax震撼开源，突破传统Transformer架构，4560亿参数，支持400万长上下文

专知会员服务

21+阅读 · 2025年1月15日