MAPS：面向智能体性能与安全性的多语言基准测试 (MAPS: A Multilingual Benchmark for Agent Performance and Security)

Omer Hofman,Jonathan Brokman,Oren Rachmil,Shamik Bose,Vikas Pahuja,Toshiya Shimizu,Trisha Starostina,Kelly Marchisio,Seraphina Goldfarb-Tarrant,Roman Vainshtein

from arxiv, Accepted to EACL 2026 findings

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI and recent initial efforts toward multilingual interaction, existing benchmarks do not yet provide a comprehensive, multi-domain, security-aware evaluation of multilingual agentic systems. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-Bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances - enabling a systematic analysis of the Multilingual Effect on AI agents' performance and robustness. Empirically, we observe a degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. This work establishes the first standardized evaluation framework for multilingual agentic AI, encouraging future research towards equitable, reliable, and accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

翻译：基于大型语言模型（LLM）并能够与工具及记忆系统交互的智能体人工智能系统，其能力与应用范围正快速发展。然而，由于LLM已被证明在多语言环境中表现不佳，通常导致性能下降和安全性降低，智能体系统可能继承这些缺陷。这引发了对此类系统可访问性的担忧，因为使用非英语语言的用户可能会遇到不可靠或存在安全风险的智能体行为。尽管对智能体人工智能评估的兴趣日益增长，且近期已出现针对多语言交互的初步探索，但现有基准测试尚未提供全面、多领域、具备安全感知的多语言智能体系统评估方案。为填补这一空白，我们提出了MAPS——一个旨在跨多种语言与任务评估智能体人工智能系统的多语言基准测试套件。MAPS基于四个广泛使用的智能体基准构建：GAIA（现实世界任务）、SWE-Bench（代码生成）、MATH（数学推理）以及Agent Security Benchmark（安全性）。我们将每个数据集翻译为十一种不同的语言，最终形成805个独立任务和总计9,660个语言特定实例，从而实现对多语言效应在AI智能体性能与鲁棒性影响的系统性分析。实验结果表明，从英语转换到其他语言时，智能体的性能与安全性均出现下降，其严重程度因任务而异，并与翻译输入量相关。本研究建立了首个面向多语言智能体人工智能的标准化评估框架，旨在推动未来研究朝着公平、可靠、可访问的智能体人工智能方向发展。MAPS基准测试套件已公开发布于 https://huggingface.co/datasets/Fujitsu-FRE/MAPS