Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
翻译:确保大型语言模型(LLMs)的可信度至关重要。大多数研究集中于完全预训练的LLMs,以更好地理解和提升其可信度。本文旨在揭示预训练的未开发潜力,率先探索此阶段LLMs的可信度,聚焦五个关键维度:可靠性、隐私性、毒性、公平性和鲁棒性。首先,我们对LLMs应用线性探针。高探针准确率表明,\textit{早期预训练阶段的LLMs已能区分每个可信度维度的概念}。因此,为进一步发掘预训练的隐藏可能性,我们从LLM的预训练检查点中提取引导向量,以增强其可信度。最后,受~\citet{choi2023understanding} 启发(互信息估计受线性探针准确率约束),我们使用互信息探针分析LLMs,研究预训练期间可信度的动态变化。我们首次观察到类似的两阶段现象:拟合与压缩~\citep{shwartz2017opening}。本研究初步探索了LLM预训练阶段的可信度建模,旨在揭示新见解并推动该领域进一步发展。我们将在 \url{https://github.com/ChnQ/TracingLLM} 公开发布代码。