Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.
翻译:近期研究表明,大型语言模型可能通过对抗性及表面看似正常的输入被操纵,从而产生有害、有偏见或违反政策的结果。本文探讨了一个尚未充分研究的议题:有害且具有毒性的数学应用题。我们证明,数学问题——尤其是以自然语言叙述形式呈现的问题——可以作为传播偏见、不道德或心理有害内容的隐蔽媒介,在涉及儿童的教育场景中风险尤为突出。为系统研究这一现象,我们引入了ToxicGSM数据集,包含1,900个算术问题,其中嵌入了有害或敏感语境,同时保留了数学定义明确的推理任务。利用该数据集,我们审计了现有大型语言模型的行为,并分析了安全约束与数学正确性之间的权衡。我们进一步提出SafeMath——一种安全对齐技术,在减少有害输出的同时,维持甚至提升数学推理性能。我们的结果强调了将语言危害与数学推理剥离的重要性,并表明有效的安全对齐不必以准确性为代价。开源代码及数据集发布于https://github.com/Swagnick99/SafeMath/tree/main。