Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.
翻译:在时长一小时以上的视频中寻找信息对于当前性能最优的视觉语言模型而言仍具挑战性,因为视觉内容的编码会迅速超出可用上下文窗口的限制。为应对这一挑战,我们提出FALCONEye——一种基于免训练、模型无关元架构的新型视频智能体,该架构由视觉语言模型与大语言模型协同构成。FALCONEye通过探索式搜索算法回答开放式问题,该算法以视觉语言模型答案的校准置信度为引导。我们还推出了FALCON-Bench基准测试,将视频问答问题扩展至视频答案搜索领域——要求模型针对时长一小时的视频中的开放式问题,同时返回答案及其对应的时间窗口支撑证据。仅使用7B参数的视觉语言模型与轻量化大语言模型,FALCONEye在FALCON-Bench中超越了所有开源7B视觉语言模型及同类智能体。该模型在MLVU基准测试中进一步展现了泛化能力,在较短视频和不同任务场景下,其单细节任务表现超越GPT-4o,同时将推理成本降低约一个数量级。