Sola-Visibility-ISPM: Benchmarking Agentic AI for Identity Security Posture Management Visibility

from arxiv, 20 pages, 3 figures. Benchmark and evaluation framework for agentic AI in identity security posture management, including expert evaluation and LLM-as-judge analysis

Identity Security Posture Management (ISPM) is a core challenge for modern enterprises operating across cloud and SaaS environments. Answering basic ISPM visibility questions, such as understanding identity inventory and configuration hygiene, requires interpreting complex identity data, motivating growing interest in agentic AI systems. Despite this interest, there is currently no standardized way to evaluate how well such systems perform ISPM visibility tasks on real enterprise data. We introduce the Sola Visibility ISPM Benchmark, the first benchmark designed to evaluate agentic AI systems on foundational ISPM visibility tasks using a live, production-grade identity environment spanning AWS, Okta, and Google Workspace. The benchmark focuses on identity inventory and hygiene questions and is accompanied by the Sola AI Agent, a tool-using agent that translates natural-language queries into executable data exploration steps and produces verifiable, evidence-backed answers. Across 77 benchmark questions, the agent achieves strong overall performance, with an expert accuracy of 0.84 and a strict success rate of 0.77. Performance is highest on AWS hygiene tasks, where expert accuracy reaches 0.94, while results on Google Workspace and Okta hygiene tasks are more moderate, yet competitive. Overall, this work provides a practical and reproducible benchmark for evaluating agentic AI systems in identity security and establishes a foundation for future ISPM benchmarks covering more advanced identity analysis and governance tasks.

翻译：身份安全态势管理（ISPM）是现代企业在云和SaaS环境中运营面临的核心挑战。回答基本的ISPM可见性问题，例如理解身份清单和配置卫生状况，需要解析复杂的身份数据，这推动了对智能体AI系统日益增长的兴趣。尽管兴趣浓厚，但目前尚无标准化方法来评估此类系统在真实企业数据上执行ISPM可见性任务的表现。我们推出了Sola可见性ISPM基准，这是首个旨在使用涵盖AWS、Okta和Google Workspace的实时生产级身份环境，评估智能体AI系统在基础ISPM可见性任务上表现的基准。该基准聚焦于身份清单与卫生状况问题，并配套提供了Sola AI智能体——一个能够将自然语言查询转换为可执行的数据探索步骤，并生成可验证、有证据支持的答案的工具使用型智能体。在77个基准问题上，该智能体取得了强劲的整体性能，专家准确率达到0.84，严格成功率为0.77。在AWS卫生任务上性能最高，专家准确率达到0.94，而在Google Workspace和Okta卫生任务上的结果则较为温和，但仍具竞争力。总体而言，这项工作为评估身份安全领域的智能体AI系统提供了一个实用且可复现的基准，并为未来覆盖更高级身份分析与治理任务的ISPM基准奠定了基础。