Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.
翻译:大语言模型(LLM)正日益应用于金融和医疗等专业领域,同时也带来了独特的安全风险。领域特定的有害提示数据集仍然稀缺,且很大程度上依赖人工构建;现有公开数据集主要关注显式有害提示,而现代LLM防御机制通常能够检测并拒绝此类提示。相比之下,通过间接领域知识表达的隐式有害提示更难被检测,更能反映现实世界中的威胁。我们识别出两大挑战:将领域知识转化为可操作的约束条件,以及提升生成有害提示的隐晦性。为解决这些问题,我们提出了一个端到端框架:首先通过知识图谱引导的有害提示生成系统性地产生领域相关提示,随后应用双路径混淆重写技术,通过直接重写和上下文增强重写将显式有害提示转化为隐式变体。该框架能够生成兼具强领域相关性与高度隐晦性的高质量数据集,从而支持更贴近现实的对抗测试,并推动LLM安全研究的发展。我们在GitHub上开源了相关代码与数据集。