Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.

翻译：确保大型语言模型（LLMs）的可信度至关重要。大多数研究集中于完全预训练的LLMs，以更好地理解和提升其可信度。本文旨在揭示预训练的未开发潜力，率先探索此阶段LLMs的可信度，聚焦五个关键维度：可靠性、隐私性、毒性、公平性和鲁棒性。首先，我们对LLMs应用线性探针。高探针准确率表明，\textit{早期预训练阶段的LLMs已能区分每个可信度维度的概念}。因此，为进一步发掘预训练的隐藏可能性，我们从LLM的预训练检查点中提取引导向量，以增强其可信度。最后，受~\citet{choi2023understanding} 启发（互信息估计受线性探针准确率约束），我们使用互信息探针分析LLMs，研究预训练期间可信度的动态变化。我们首次观察到类似的两阶段现象：拟合与压缩~\citep{shwartz2017opening}。本研究初步探索了LLM预训练阶段的可信度建模，旨在揭示新见解并推动该领域进一步发展。我们将在 \url{https://github.com/ChnQ/TracingLLM} 公开发布代码。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日