Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.
翻译:深度研究型智能体正日益因其搜索证据、推理检索来源并生成有依据答案的能力而受到评估。然而,现有的浏览基准大多假设用户查询与支持证据使用同一种语言,这使得当相关证据以另一种语言出现时,智能搜索系统的表现仍存疑问。我们引入XBCP(跨语言BrowseComp-Plus),这是一个受控基准,它保留了BrowseComp-Plus的英文问答空间,但改变了支持文档的语言。XBCP实现了两种互补设定:在跨语言设定中,每个查询与指定单一语言的证据配对;在多语言设定中,完整证据语料库平均且随机分布到涵盖高资源与低资源场景的12种语言中。我们使用稀疏与稠密多语言检索器评估了四个深度研究型智能体,衡量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度以及预言检索。结果显示,当证据被翻译后,性能显著下降。即使强大的稠密检索器也会损失证据召回率,智能体校准度变差,且引用证据的可靠性降低。值得注意的是,即使直接提供全部黄金证据,准确性仍然较低。这些发现表明,跨语言深度研究暴露了检索失败以及智能体在整合语言不匹配证据时独立存在的困难。