Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term "Hierarchical" reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.

翻译：社交平台彻底改变了信息共享方式，但也加速了有害及违规内容的传播。为确保大规模内容的安全性与合规性，审核系统必须超越效率追求，兼顾准确性与可解释性。然而，当前方法主要依赖噪声标签驱动的学习模式，既缺乏与审核规则的对齐，又产生阻碍人工复核的不透明决策。为此，我们提出层级化防护系统（Hi-Guard），这是一个引入新型策略对齐决策范式的多模态审核框架。“层级化”体现在系统设计的两个关键维度：（1）层级化审核流程：轻量级二元模型先过滤安全内容，再由更强模型处理细粒度风险分类；（2）第二阶段采用层级化分类体系：模型在从粗粒度到细粒度逐层递进的层级化分类树上执行基于路径的分类。为确保与动态演进的审核策略对齐，Hi-Guard 直接将规则定义融入模型提示。为进一步增强结构化预测与推理能力，我们引入多级软间隔奖励机制，并采用组相对策略优化（GRPO）进行训练，通过惩罚语义相邻的误分类提升解释质量。大量实验与实际部署表明，Hi-Guard 在分类准确率、泛化能力与可解释性方面均取得显著优势，为构建可扩展、透明且可信赖的内容安全系统开辟了新路径。代码已开源：https://github.com/lianqi1008/Hi-Guard。