Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks such as localized financial scams, culturally embedded hate speech, and misinformation patterns. To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin. TS-Bench contains 400 human-curated prompts spanning critical domains including financial fraud, medical misinformation, social discrimination, and political manipulation. In parallel, we present Breeze Guard, an 8B safety model derived from Breeze 2, our previously released general-purpose Taiwanese Mandarin LLM with strong cultural grounding from its original pre-training corpus. Breeze Guard is obtained through supervised fine-tuning on a large-scale, human-verified synthesized dataset targeting Taiwan-specific harms. Our central hypothesis is that effective safety detection requires the cultural grounding already present in the base model; safety fine-tuning alone is insufficient to introduce new socio linguistic knowledge from scratch. Empirically, Breeze Guard significantly outperforms the leading 8B general-purpose safety model, Granite Guardian 3.3, on TS-Bench (+0.17 overall F1), with particularly large gains in high-context categories such as scam (+0.66 F1) and financial malpractice (+0.43 F1). While the model shows slightly lower performance on English-centric benchmarks (ToxicChat, AegisSafetyTest), this tradeoff is expected for a regionally specialized safety model optimized for Taiwanese Mandarin. Together, Breeze Guard and TS-Bench establish a new foundation for trustworthy AI deployment in Taiwan.

翻译：全球安全模型在广泛使用的基准测试中展现出强劲性能，但其训练数据很少能捕捉台湾普通话的文化与语言细微差别。这一局限导致在解读特定区域风险时存在系统性盲点，例如本地化金融诈骗、文化根植的仇恨言论及错误信息模式。为弥补这些不足，我们引入TS-Bench（台湾安全基准），这是一个用于评估台湾普通话安全性能的标准化测试套件。TS-Bench包含400个人工编制的提示词，涵盖金融欺诈、医疗错误信息、社会歧视和政治操纵等关键领域。同时，我们推出Breeze Guard——一个基于Breeze 2衍生的80亿参数安全模型。Breeze 2是我们先前发布的通用型台湾普通话大语言模型，其原始预训练语料库具备深厚的文化根基。Breeze Guard通过对大规模人工验证的合成数据集进行监督微调获得，该数据集专门针对台湾本土危害构建。我们的核心假设是：有效的安全检测需要基础模型中已有的文化根基；仅靠安全微调不足以从零开始引入新的社会语言学知识。实证表明，Breeze Guard在TS-Bench上显著超越领先的80亿参数通用安全模型Granite Guardian 3.3（整体F1分数提升0.17），在诈骗（F1+0.66）和金融失当行为（F1+0.43）等高语境类别中提升尤为显著。虽然该模型在以英语为中心的基准测试（ToxicChat、AegisSafetyTest）上表现略逊，但这种权衡对于专为台湾普通话优化的区域专业化安全模型是可预期的。Breeze Guard与TS-Bench共同为台湾地区可信人工智能的部署奠定了新基础。