Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
翻译:在多轮对话中检测有害内容需要基于完整对话上下文进行推理,而非孤立的话语。然而,现有方法大多依赖模型内部的参数化知识,缺乏对外部规范性原则的显式参照。这常导致社交微妙情境下的判断不一致、可解释性受限以及跨轮次推理冗余。为此,我们提出RoTRAG——一种检索增强框架,将简洁的人类书写道德规范(称为"经验法则")融入基于大语言模型的危害评估。在每一轮对话中,RoTRAG从外部语料库检索相关经验法则,并将其用作轮次级推理和最终严重性分类的显式规范性证据。为进一步提升效率,我们引入一个轻量级二值路由分类器,用于判断新轮次是否需要基于检索的推理,或可直接复用现有上下文。在ProsocialDialog和Safety Reasoning Multi Turn Dialogue数据集上的实验表明,RoTRAG在危害分类与严重性估计两项任务上均持续优于强基线模型,基准数据集上的F1平均相对提升约40%,分布误差平均相对降低8.4%,同时在不牺牲性能的前提下减少了冗余计算。