MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.
翻译:MiniMind-O是一个基于MiniMind语言模型构建的开源0.1B规模全模态模型。该模型可接收文本、语音和图像输入,同时输出文本与流式语音。本次发布包含模型代码、检查点,以及用于文本转音频、图像转文本和音频转音频训练的主要Parquet格式数据集,使得完整交互流程可直接检视。模型采用完整MiniMind主干作为思考器,并配备由MiniMind模块构成的独立四层说话器。冻结的SenseVoice-Small和SigLIP2编码器提取语音和图像特征,通过轻量级MLP投影器映射后注入模态占位符位置。说话器读取思考器中间层状态,并结合自回归八层Mimi码缓冲区。说话人控制由专用说话人令牌、右对齐参考编解码提示及预计算CAM++说话人嵌入实现,使得声音调节作为音频码上下文的一部分,而非独立TTS模块。在768维说话器配置下,稠密变体与MoE变体在思考器-说话器一致性评估中平均字符错误率分别达到0.0897和0.0900,整体语音克隆相似度分别为0.5995与0.5937。除报告可运行系统外,本文指出了小规模全模态模型的三个规模关键设计选择:中间层语义桥接、已发布的多模态序列格式,以及参数高效的八码本接口。