The deployment of Large Language Models (LLMs) in diverse applications necessitates an assurance of safety without compromising the contextual integrity of the generated content. Traditional approaches, including safety-specific fine-tuning or adversarial testing, often yield safe outputs at the expense of contextual meaning. This can result in a diminished capacity to handle nuanced aspects of bias and toxicity, such as underrepresentation or negative portrayals across various demographics. To address these challenges, we introduce MBIAS, an LLM framework carefully instruction fine-tuned on a custom dataset designed specifically for safety interventions. MBIAS is designed to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. This work also details our further use of LLMs: as annotator under human supervision and as evaluator of generated content. Empirical analysis reveals that MBIAS achieves a reduction in bias and toxicity by over 30\% in standard evaluations, and by more than 90\% in diverse demographic tests, highlighting the robustness of our approach. We make the dataset and the fine-tuned model available to the research community for further investigation and ensure reproducibility. The code for this project can be accessed here https://github.com/shainarazavi/MBIAS/tree/main. Warning: This paper contains examples that may be offensive or upsetting.
翻译:大型语言模型(LLMs)在多样化应用中的部署,需要确保安全性而不损害生成内容的上下文完整性。传统方法,包括针对安全性的微调或对抗性测试,往往以牺牲上下文意义为代价来获得安全输出。这可能导致模型处理偏见和毒性细微方面的能力下降,例如对不同人口统计群体的代表性不足或负面描绘。为应对这些挑战,我们提出了MBIAS,这是一个在专门为安全干预设计的定制数据集上经过仔细指令微调的LLM框架。MBIAS旨在显著减少LLM输出中的偏见和有毒元素,同时保留主要信息。本工作还详细介绍了我们对LLM的进一步应用:作为人类监督下的标注者以及生成内容的评估者。实证分析表明,MBIAS在标准评估中将偏见和毒性降低了30%以上,在多样化人口统计测试中降低了90%以上,凸显了我们方法的鲁棒性。我们将数据集和微调后的模型提供给研究社区以供进一步研究,并确保可复现性。本项目的代码可在此处访问:https://github.com/shainarazavi/MBIAS/tree/main。警告:本文包含可能具有冒犯性或令人不安的示例。