Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
翻译:长时音频在工业和消费场景中日益普遍,但审阅多小时的录音并不现实,这促使需要能够通过自然语言查询提供精确时间定位且幻觉最小化的系统。现有音频-语言模型展现出潜力,但由于上下文长度限制,长音频问答仍然具有挑战性。我们提出了长音频RAG(LA-RAG),一种混合框架,将大型语言模型(LLM)的输出基于检索到的带时间戳的声学事件检测结果,而非原始音频。多小时的音频流被转换为结构化的事件记录并存储在SQL数据库中,在推理时,系统解析自然语言中的时间指代,对意图进行分类,仅检索相关事件,并利用这些受限证据生成答案。为评估性能,我们通过拼接保留时间戳的录音并生成基于模板的检测、计数和摘要任务问答对,构建了一个合成的长音频基准测试。最后,我们通过在混合边缘-云环境中部署该系统来展示其实用性,其中音频定位模型在物联网级硬件上本地运行,而LLM则托管在GPU支持的服务器上。这种架构实现了边缘端的低延迟事件提取和云端的高质量语言推理。实验表明,与传统的检索增强生成(RAG)或文本到SQL方法相比,结构化的、事件级别的检索显著提高了准确性。