Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce \emph{internal flow signatures} that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact \emph{moving} readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. \emph{Code is available at} \texttt{github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs}.
翻译:大型语言模型能够生成流畅但偏离给定上下文的答案,而现有的安全措施多依赖于生成后的外部验证或独立评判模块。本文提出一种\emph{内部流特征}方法,通过在固定的块间监测边界处审计深度动态的决策形成过程来实现自检。该方法通过偏置中心化监测稳定词元级运动,随后在紧凑的\emph{移动式}读出对齐子空间中对轨迹进行汇总——这些子空间由每个深度窗口内的最高概率词元及其邻近竞争词元构建而成。相邻窗口帧通过正交传输进行对齐,产生具有深度可比性的传输步长、转向角以及子空间漂移摘要,这些特征对窗口内的基选择具有不变性。基于这些特征训练的轻量级GRU验证器可在不修改基础模型的情况下执行自检。除检测功能外,该验证器还能定位导致错误的深度事件,并实现针对性精炼:模型回退至问题词元,在识别出的模块处钳制异常的传输步长,同时保留正交残差。最终构建的流程能够基于内部决策动态提供可操作的定位与低开销的自检功能。\emph{代码发布于} \texttt{github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs}。