Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.
翻译:由于大语言模型(LLMs)的黑箱特性及其生成内容的真实性,幻觉、偏见、不公平和版权侵权等问题日益凸显。在此背景下,从多维度进行信息溯源至关重要。本综述围绕四个相互关联的维度展开系统性研究:模型溯源、模型结构溯源、训练数据溯源和外部数据溯源。此外,本文提出一种统一的双范式分类法,将现有溯源方法划分为基于先验(主动式可追溯性嵌入)和基于后验(回溯式推理)两类。跨维度的可追溯性增强了LLMs在实际应用部署中的透明度、可问责性与可信度。