Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{<CALL>} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.
翻译:语言模型持续发展,不断将更多世界知识压缩至其参数中,但可预训练至模型内的知识受其参数规模上限约束。特别是小型语言模型(SLMs)的容量有限,常导致生成内容存在事实性错误。该问题通常通过为SLM提供外部访问能力来缓解:使其能够查询更大模型、文档或数据库。在此设定下,我们研究一个根本性问题:\emph{哪些词元应在预训练阶段由SLM学习},而\emph{哪些应通过\texttt{<CALL>}标记进行委托}。我们发现这并非简单的损失函数问题:尽管损失值能预测生成词元是否与真实值存在偏差,但部分词元属于\emph{可接受范畴}——它们作为预训练文档的真实性替代延续,即使对应损失值较高也不应触发\texttt{<CALL>}机制。研究表明,spaCy语法解析器可辅助增强损失信号,以决策哪些词元应由SLM学习委托(避免事实错误),哪些即使在高损失情况下仍可安全学习并预测。基于此词元选择理念,我们提出新型预训练方法LaCy。实验证明,LaCy模型能成功掌握应预测的词元与需委托处理的节点。在与更大模型级联生成时,该方法获得更高的FactScore,其性能优于经Rho或LLM-judge训练的SLM,同时具备更简明的架构与更低廉的成本。