Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.

翻译：随着大型语言模型（LLM）在波兰语应用中的部署日益广泛，高效且准确的内容安全分类器变得至关重要。本文提出Bielik Guard系列——一组紧凑型波兰语安全分类器，包含两种模型变体：基于MMLW-RoBERTa-base的0.1B参数模型和基于PKOBP/polish-roberta-8k的0.5B参数模型。这些模型在6,885条社区标注的波兰语文本数据集上微调，可将内容划分为五类安全类别：仇恨/攻击性言论、粗俗内容、色情内容、犯罪内容和自残内容。评估结果表明，两种模型在多个基准测试中均表现出色。0.5B变体在测试集上展现出最佳整体判别能力，其F1分数达到0.791（微观）和0.785（宏观），而0.1B变体则表现出卓越的效率。值得注意的是，在真实用户提示数据上，Bielik Guard 0.1B v1.1实现了更高的精确率（77.65%）和极低的误报率（0.63%），在模型参数量相同的情况下显著优于HerBERT-PL-Guard（精确率31.55%，误报率4.70%）。本系列模型已开源发布，其设计目标在于提供恰当的响应而非简单的内容屏蔽，尤其针对自残等敏感类别。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。

关于 GPT-5.2、Gemini 3 Pro、Qwen3-VL、豆包 1.8、Grok 4.1 Fast、Nano Banana Pro 及 Seedream 4.5 的安全性研究报告

专知会员服务

25+阅读 · 1月18日

《大语言模型与国际安全：导论》最新报告29页

专知会员服务

30+阅读 · 2025年1月6日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

49+阅读 · 2024年12月24日

《大型语言模型保护措施》综述

专知会员服务

29+阅读 · 2024年6月6日