Machine Learning (ML) models, including Large Language Models (LLMs), are characterized by a range of system-level attributes such as security and reliability. Recent studies have demonstrated that ML models are vulnerable to multiple forms of security violations, among which backdoor data-poisoning attacks represent a particularly insidious threat, enabling unauthorized model behavior and systematic misclassification. In parallel, deficiencies in model reliability can manifest as hallucinations in LLMs, leading to unpredictable outputs and substantial risks for end users. In this work on Dependable Artificial Intelligence with Reliability and Security (DAIReS), we propose a novel unified approach based on Syndrome Decoding for the detection of both security and reliability violations in learning-based systems. Specifically, we adapt the syndrome decoding approach to the NLP sentence-embedding space, enabling the discrimination of poisoned and non-poisoned samples within ML training datasets. Additionally, the same methodology can effectively detect hallucinated content due to self referential meta explanation tasks in LLMs.
翻译:机器学习(ML)模型(包括大语言模型(LLMs))具有一系列系统级属性,例如安全性和可靠性。近期研究表明,ML模型容易遭受多种形式的安全侵害,其中后门数据投毒攻击构成了一种尤为隐蔽的威胁,可导致未经授权的模型行为和系统性误分类。与此同时,模型可靠性的缺陷可能表现为LLMs中的幻觉,从而产生不可预测的输出,给终端用户带来重大风险。在本项关于可靠性与安全性兼备的可信人工智能(DAIReS)的研究中,我们提出了一种基于伴随式解码的新颖统一方法,用于检测基于学习的系统中的安全性与可靠性侵害。具体而言,我们将伴随式解码方法适配到自然语言处理(NLP)的句子嵌入空间,从而能够区分ML训练数据集中的中毒样本与非中毒样本。此外,该方法同样能有效检测LLMs中因自指涉元解释任务而产生的幻觉内容。