We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
翻译:我们提出了一种语音智能体框架,其学习一项关键的全理解技能:知道何时信任自身判断,何时借助外部音频感知。这项工作的动机源于一个重要却反直觉的发现:简单地在语音识别和外部声音理解任务上对全能模型进行微调,往往会降低性能,因为模型容易被噪声假设所误导。为解决此问题,我们的框架 Speech-Hands 将问题重新定义为一种显式的自反思决策。这种可学习的反思原语被证明能有效防止模型被有缺陷的外部候选结果带偏。我们展示了这种智能体行动机制能够自然地由语音识别推广至复杂的、多项选择式的音频推理任务。在 OpenASR 排行榜上,Speech-Hands 在七个基准测试中持续超越强基线模型,词错误率(WER)平均降低 12.1%。该模型在音频问答决策上也达到了 77.37% 的准确率与较高的 F1 分数,在多样化的音频问答数据集上展现出强大的泛化能力和可靠性。通过统一感知与决策,我们的工作为构建更可靠、更具鲁棒性的音频智能提供了一条实用路径。