As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.
翻译:随着大型模型(LMs)的快速发展,其安全性也成为一个优先事项。在当前的大型语言模型(LLMs)和多模态大型语言模型(MLLMs)安全工作中,评估、诊断和对齐通常由独立的工具处理。具体而言,安全评估只能定位外部行为风险,而无法查明内部根本原因。同时,安全诊断往往脱离具体的风险场景,停留在可解释性层面。这样一来,安全对齐缺乏对内部机制变化的专门解释,可能损害模型的通用能力。为系统性地解决这些问题,我们提出了一个开源项目,即DeepSight,以实践一种新的安全评估-诊断一体化范式。DeepSight是一个低成本、可复现、高效且高度可扩展的大规模模型安全评估项目,由评估工具包DeepSafe和诊断工具包DeepScan组成。通过统一任务与数据协议,我们在两个阶段之间建立了联系,并将安全评估从黑盒洞察转变为白盒洞察。此外,DeepSight是首个支持前沿人工智能风险评估以及联合安全评估与诊断的开源工具包。