Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.
翻译:大语言模型(LLM)正越来越多地部署在成本敏感和端侧场景中,而其安全护栏主要是在英语环境中发展起来的。然而,现实世界中的中文恶意查询通常通过同音字、拼音、基于符号的拆分以及其他中文特有的模式来隐藏意图。这些中文特有的对抗模式造成了现有专注于英语的基准测试未能很好捕捉到的安全评估差距。这一差距对于轻量级模型尤其令人担忧,因为它们可能更容易受到此类特定对抗性扰动的影响。为了弥补这一差距,我们引入了中文特定安全基准(CSSBench),该基准强调这些对抗模式,并评估轻量级大语言模型在中文环境下的安全性。我们的基准涵盖了现实中文场景中常见的六个领域,包括非法活动与合规、隐私泄露、健康与医疗虚假信息、欺诈与仇恨、成人内容以及公共与政治安全,并将查询组织成多种任务类型。我们评估了一系列流行的轻量级大语言模型,并通过测量过度拒绝行为来评估因安全性导致的性能下降。我们的结果表明,中文特有的对抗模式是轻量级大语言模型面临的一个关键挑战。该基准为中文环境下的大语言模型安全性提供了全面的评估,有助于在实践中实现稳健的部署。