Seclens: Role-specific Evaluation of LLM's for security vulnerablity detection

Existing benchmarks for LLM-based vulnerability detection compress model performance into a single metric, which fails to reflect the distinct priorities of different stakeholders. For example, a CISO may emphasize high recall of critical vulnerabilities, an engineering leader may prioritize minimizing false positives, and an AI officer may balance capability against cost. To address this limitation, we introduce SecLens-R, a multi-stakeholder evaluation framework structured around 35 shared dimensions grouped into 7 measurement categories. The framework defines five role-specific weighting profiles: CISO, Chief AI Officer, Security Researcher, Head of Engineering, and AI-as-Actor. Each profile selects 12 to 16 dimensions with weights summing to 80, yielding a composite Decision Score between 0 and 100. We apply SecLens-R to evaluate 12 frontier models on a dataset of 406 tasks derived from 93 open-source projects, covering 10 programming languages and 8 OWASP-aligned vulnerability categories. Evaluations are conducted across two settings: Code-in-Prompt (CIP) and Tool-Use (TU). Results show substantial variation across stakeholder perspectives, with Decision Scores differing by as much as 31 points for the same model. For instance, Qwen3-Coder achieves an A (76.3) under the Head of Engineering profile but a D (45.2) under the CISO profile, while GPT-5.4 shows a similar disparity. These findings demonstrate that vulnerability detection is inherently a multi-objective problem and that stakeholder-aware evaluation provides insights that single aggregated metrics obscure.

翻译：现有的大语言模型（LLM）漏洞检测基准将模型性能压缩为单一指标，未能反映不同利益相关方的差异化需求。例如，首席信息安全官（CISO）可能强调关键漏洞的高召回率，工程负责人优先减少误报，而人工智能官则需要在能力与成本之间取得平衡。针对这一局限，我们提出SecLens-R——一种多利益相关方评估框架，围绕35个共享维度构建，分属7个评估类别。该框架定义了五种角色权重配置：CISO、首席人工智能官、安全研究员、工程负责人及AI作为执行者。每个角色配置选择12至16个维度，权重之和为80，生成0至100分的综合决策得分。我们使用SecLens-R框架评估了12个前沿模型，数据集包含源自93个开源项目的406项任务，覆盖10种编程语言及8类符合OWASP标准的漏洞类别。评估在两种设定下进行：提示内代码（CIP）与工具使用（TU）。结果显示，不同利益相关方的视角存在显著差异，同一模型在不同角色配置下的决策得分最大相差31分。例如，Qwen3-Coder在工程负责人配置下获得A级（76.3分），但在CISO配置下仅获D级（45.2分）；GPT-5.4也呈现类似差异。这些发现表明，漏洞检测本质上是一个多目标问题，而利益相关方感知的评估方法能揭示单一聚合指标所掩盖的洞察。