Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available https://www.youtube.com/watch?v=zaSiT3clWqY, demo is available via https://openomni.ai4wa.com, code is available via https://github.com/AI4WA/OpenOmniFramework.
翻译:多模态对话代理因其提供自然且类人的交互方式而备受青睐。然而,目前缺乏全面的端到端解决方案来支持协作开发和基准测试。尽管像GPT-4o和Gemini这样的专有系统在音频、视频和文本的集成方面表现出色,响应时间达到200-250毫秒,但在平衡延迟、准确性、成本和数据隐私方面仍面临挑战。为了更好地理解和量化这些问题,我们开发了OpenOmni,一个开源、端到端的流水线基准测试工具。它集成了语音转文本、情感检测、检索增强生成、大语言模型等先进技术,并支持集成定制模型。OpenOmni支持本地和云端部署,确保数据隐私,并支持延迟和准确性基准测试。这一灵活框架使研究人员能够定制流水线,聚焦于实际瓶颈,并促进快速的概念验证开发。OpenOmni可以显著增强诸如为视障人士提供室内辅助等应用,从而推动人机交互的发展。我们的演示视频可通过https://www.youtube.com/watch?v=zaSiT3clWqY获取,演示可通过https://openomni.ai4wa.com访问,代码可通过https://github.com/AI4WA/OpenOmniFramework获取。