Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent's efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
翻译:近年来,大型语言模型(LLMs)的能力已扩展至多模态领域,包括全面的视频理解。然而,处理诸如24小时监控录像或全长电影等长视频时,由于数据量庞大和处理需求高,带来了重大挑战。传统方法(如提取关键帧或将帧转换为文本)通常会导致大量信息丢失。为应对这些不足,我们开发了OmAgent,它能高效存储和检索与特定查询相关的视频帧,从而保留视频的详细内容。此外,该框架具备一个能够自主推理的“分治循环”,可动态调用API和工具以提升查询处理效率和准确性。该方法确保了鲁棒的视频理解能力,并显著减少了信息损失。实验结果证实了OmAgent在处理各类视频及复杂任务方面的有效性。此外,我们赋予了其更高的自主性和强大的工具调用系统,使其能够完成更为复杂的任务。