Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets. The dataset is presented at https://huggingface.co/datasets/NewCityLetter/RMS2/tree/main.
翻译:多模态大语言模型(MLLMs)正在快速发展,带来了日益复杂的安全挑战。然而,当前以风险为导向的数据集构建方法,难以覆盖真实世界多模态安全场景(RMS)日益增长的复杂性。并且由于缺乏统一的评估指标,其整体有效性尚未得到验证。本文针对RMS提出了一种新颖的以图像为导向的自适应数据集构建方法,该方法从图像出发,最终构建配对的文本和指导性回复。利用这种以图像为导向的方法,我们自动生成了一个包含35k个带有指导性回复的图像-文本对的RMS数据集。此外,我们引入了一个标准化的安全数据集评估指标:微调一个安全评判模型,并在其他安全数据集上评估其能力。在各种任务上进行的大量实验证明了所提出的以图像为导向的流程的有效性。结果证实了该方法的可扩展性和有效性,为构建真实世界多模态安全数据集提供了新的视角。数据集发布于 https://huggingface.co/datasets/NewCityLetter/RMS2/tree/main。