We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.
翻译:我们提出Eureka-Audio,一个紧凑而高性能的音频语言模型,在广泛的音频理解基准测试中,其性能可与规模大4至18倍的模型相竞争。尽管仅包含17亿参数,Eureka-Audio在自动语音识别(ASR)、音频理解和密集音频描述任务上展现出强大性能,匹配甚至超越了多个70亿至300亿参数的音频及全模态基线模型。该模型采用统一的端到端架构,由一个轻量级语言主干、一个基于Whisper的音频编码器和一个稀疏激活的专家混合(MoE)适配器组成,该适配器显式地处理音频异质性,并在有限容量下缓解跨模态优化冲突。为了进一步增强副语言推理能力,我们引入了DataFlux,一个闭环的音频指令数据合成与验证流程,它从原始音频中构建高质量、逻辑一致的监督信号。在ASR、知识推理、安全性、指令遵循和副语言基准测试上的广泛评估表明,Eureka-Audio在计算成本与性能之间实现了高效平衡。这些结果确立了Eureka-Audio作为轻量级音频理解模型的一个强大且实用的基线。