We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
翻译:我们提出一种语音代理框架,该框架学习一项关键的全感知技能:何时信任自身能力,何时需要借助外部音频感知。本研究的动机源于一个看似反直觉的重要发现:在语音识别与外部声音理解任务上对全感知模型进行简单微调往往会导致性能下降,因为模型容易被噪声假设误导。为此,我们提出的Speech-Hands框架将问题重新定义为明确的自我反思决策过程。这种可学习的反思原语能够有效防止模型被有缺陷的外部候选结果带偏。实验证明,这种代理式动作机制能够自然地泛化到从语音识别到复杂的多选音频推理任务。在OpenASR排行榜的七个基准测试中,Speech-Hands相较于强基线模型持续实现12.1%的词错误率降低。该模型在音频问答决策任务上达到77.37%的准确率与高F1分数,在各类音频问答数据集上展现出稳健的泛化性与可靠性。通过统一感知与决策,本研究为构建更可靠、更具韧性的音频智能提供了实用路径。