Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.
翻译:视觉感知将高层语义理解与像素级感知相连接,但现有设定大多假设识别目标的关键证据已存在于图像或冻结的模型知识中。我们研究了一个更实际但更具挑战性的开放世界场景:在定位可见物体之前,必须首先通过外部事实、近期事件、长尾实体或多跳关系来解析该物体的身份。我们将这一挑战形式化为“感知深度研究”,并引入WebEye——一个以物体为核心的基准数据集,包含可验证证据、知识密集型查询、精确的框/掩码标注,以及三种任务视图:基于搜索的定位、基于搜索的分割和基于搜索的视觉问答。WebEyes包含120张图像、473个标注物体实例、645个独特的问答对及1927个任务样本。我们进一步提出Pixel-Searcher——一种智能搜索到像素的工作流程,能够解析隐藏的目标身份并将其绑定到边界框、分割掩码或基于依据的答案中。实验表明,Pixel-Searcher在所有三个任务视图上均达到了最强的开源性能,而失败主要源于证据获取、身份解析及视觉实例绑定。