As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.
翻译:随着大型语言模型(LLM)在波兰语应用中的部署日益广泛,高效且准确的内容安全分类器变得至关重要。本文提出Bielik Guard系列——一组紧凑型波兰语安全分类器,包含两种模型变体:基于MMLW-RoBERTa-base的0.1B参数模型和基于PKOBP/polish-roberta-8k的0.5B参数模型。这些模型在6,885条社区标注的波兰语文本数据集上微调,可将内容划分为五类安全类别:仇恨/攻击性言论、粗俗内容、色情内容、犯罪内容和自残内容。评估结果表明,两种模型在多个基准测试中均表现出色。0.5B变体在测试集上展现出最佳整体判别能力,其F1分数达到0.791(微观)和0.785(宏观),而0.1B变体则表现出卓越的效率。值得注意的是,在真实用户提示数据上,Bielik Guard 0.1B v1.1实现了更高的精确率(77.65%)和极低的误报率(0.63%),在模型参数量相同的情况下显著优于HerBERT-PL-Guard(精确率31.55%,误报率4.70%)。本系列模型已开源发布,其设计目标在于提供恰当的响应而非简单的内容屏蔽,尤其针对自残等敏感类别。