Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.
翻译:在边缘基础设施上部署音频-语言模型(Audio-LLM)始终面临着感知深度与计算效率之间的固有矛盾。轻量级本地模型往往产生被动感知——即通用的摘要,缺乏多步音频推理所需的细微证据——而不加区分的云端卸载则带来难以接受的延迟、带宽成本和隐私风险。我们提出了CoFi-Agent(工具增强的由粗到精智能体),一种面向边缘服务器和网关的混合架构。它执行快速的本地感知,并仅在检测到不确定性时触发条件性的深度精细分析。CoFi-Agent首先在本地一个7B参数的Audio-LLM上运行一次初始单次推理,随后一个云端控制器对困难案例进行筛选,并为设备端工具(如时序重听和本地自动语音识别)下发轻量级执行计划。在MMAR基准测试中,CoFi-Agent将准确率从27.20%提升至53.60%,同时相比始终开启的深度分析流水线,实现了更优的准确率-效率权衡。总体而言,CoFi-Agent通过工具赋能、条件触发的边云协同,在实际系统约束下弥合了感知鸿沟。