Third-Party Risk Assessment (TPRA) is a core cybersecurity practice for evaluating suppliers against standards such as ISO/IEC 27001 and NIST. TPRA questionnaires are typically drawn from large repositories of security and compliance questions, yet tailoring assessments to organizational needs remains a largely manual process. Existing retrieval approaches rely on keyword or surface-level similarity, which often fails to capture implicit assessment scope and control semantics. This paper explores strategies for organizing and retrieving TPRA cybersecurity questions using semantic labels that describe both control domains and assessment scope. We compare direct question-level labeling with a Large Language Model (LLM) against a hybrid semi-supervised semantic labeling (SSSL) pipeline that clusters questions in embedding space, labels a small representative subset using an LLM, and propagates labels to remaining questions using k-Nearest Neighbors; we also compare downstream retrieval based on direct question similarity versus retrieval in the label space. We find that semantic labels can improve retrieval alignment when labels are discriminative and consistent, and that SSSL can generalize labels from a small labeled subset to large repositories while substantially reducing LLM usage and cost.
翻译:第三方风险评估(TPRA)是依据ISO/IEC 27001和NIST等标准评估供应商的核心网络安全实践。TPRA问卷通常从大型安全与合规问题库中抽取,但根据组织需求定制评估在很大程度上仍是一个手动过程。现有的检索方法依赖于关键词或表层相似性,往往无法捕捉隐性的评估范围和控制语义。本文探讨了使用描述控制域和评估范围的语义标签来组织和检索TPRA网络安全问题的策略。我们比较了直接使用大型语言模型(LLM)进行问题级标注的方法,以及一种混合半监督语义标注(SSSL)流程:该流程在嵌入空间中对问题进行聚类,使用LLM标注一个小的代表性子集,并通过k-最近邻算法将标签传播到剩余问题;我们还比较了基于直接问题相似性的下游检索与在标签空间中的检索。研究发现,当标签具有区分性和一致性时,语义标签可以提高检索对齐度,并且SSSL能够将标签从小的标注子集泛化到大型问题库,同时显著减少LLM的使用和成本。