We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.
翻译:我们提出了HippoCamp——一个旨在评估智能体在多模态文件管理方面能力的新基准。与现有侧重于通用场景下网页交互、工具使用或软件自动化的智能体基准不同,HippoCamp在用户中心化环境中评估智能体,以建模个体用户画像并在海量个人文件中进行上下文感知推理。该基准基于涵盖多种模态的真实用户画像实例化了设备级文件系统,包含超过2000个真实文件,总计42.4 GB数据。基于原始文件,我们构建了581个问答对,以评估智能体在搜索、证据感知和多步推理方面的能力。为实现细粒度分析,我们提供了46.1K条密集标注的结构化轨迹,用于逐步失败诊断。我们在HippoCamp上评估了多种最先进的多模态大语言模型(MLLM)及智能体方法。综合实验揭示了显著性能差距:即便最先进的商业模型在用户画像任务中仅达48.3%的准确率,在处理密集个人文件系统内的长周期检索和跨模态推理时尤为困难。此外,逐步失败诊断将多模态感知与证据定位识别为主要瓶颈。最终,HippoCamp揭示了当前智能体在真实用户中心化环境中的关键局限性,为开发下一代个人AI助手奠定了坚实基础。