With the recent advancements in reasoning capa- bilities, tool calling using MCP servers and Audio Language Models (ALMs), development and integration of multi-modal agents (with voice and text support) has come to the industry forefront. Cascading pipelines for voice agents still play a central role in the industry owing to their superior reasoning capabilities facilitated by LLMs. Although, cascading pipelines often present error propagation through the pipeline. We propose a framework, FOCAL to benchmark end-to-end reasoning, component-wise error propagation and error analysis for automated as well as human-assisted testing of multi-modal agents (voice to voice + text input). We also share two novel metrics viz. Reasoning and Semantic scores to evaluate efficacy of the agent in having meaningful conversations in voice mode.
翻译:随着推理能力、基于MCP服务器的工具调用以及音频语言模型(ALM)的最新进展,支持语音与文本的多模态智能体的开发与集成已成为行业前沿。得益于大型语言模型(LLM)提供的卓越推理能力,级联式语音智能体流水线在行业中仍占据核心地位。然而,级联流水线通常存在误差沿管道传播的问题。本文提出一个名为FOCAL的框架,用于对多模态智能体(语音到语音+文本输入)的端到端推理能力、组件级误差传播以及误差分析进行自动化及人工辅助测试的基准评估。我们还提出了两项新颖的评估指标,即推理得分与语义得分,用以衡量智能体在语音模式下进行有意义对话的效能。