In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.
翻译:本文介绍了Speech-Copilot,一个面向指令型语音处理任务的模块化框架,旨在最小化人工构建工具集的工作量。与使用大型音频-语言模型的端到端方法不同,Speech-Copilot通过分析预先收集的任务指令并将任务分解为可管理的子任务,构建语音处理专用的工具集。其核心是一个基于大语言模型的灵活智能体,通过程序生成来执行任务。我们的方法在Dynamic-SUPERB基准测试中取得了最先进的性能,证明了其在多样化语音处理任务中的有效性。主要贡献包括:1)开发了一个创新的语音处理专用工具集构建框架;2)建立了一个基于大语言模型的高性能智能体;3)为应对具有挑战性的指令型语音处理任务提供了新的视角。无需端到端方法所需的额外训练过程,本方法为广泛的语音处理应用提供了一个灵活且可扩展的解决方案。