Pretraining Language Models for Diachronic Linguistic Change Discovery

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

翻译：大型语言模型（LLM）已展现出作为科学发现工具的潜力，这激发了其在人文学科（如历史语言学和文学研究）中日益增长的应用兴趣。这些领域通常基于体裁划分或更严格的时间分期来构建论点。尽管已有研究尝试通过微调或模型编辑将推理限制在特定领域，但我们认为，唯一真正可靠的保证是领域受限的预训练——这通常是一项数据与计算成本高昂的任务。我们证明，高效的预训练技术能够在规模过大而难以人工检查、但又过小而不适用于“典型”LLM方法的语料库上训练出实用模型。我们采用一种新颖的日期归因流程，获得了五个各含1000万词的时间分段数据集。我们在这些语料片段上训练了两组对应的五模型组：高效预训练模型与基于Llama3-8B参数高效微调的模型。研究发现，预训练模型比微调基线训练速度更快，且能更好地遵循语料库的历史分期。强调速度与精确性而非非历史性的全面覆盖，为目标领域中的假设发现与检验提供了多种新方法。以历时语言学作为测试平台，我们证明该方法能够检测多种语言现象，包括大规模词汇变化、非词汇（语法与形态）变化以及词义引入/淘汰。我们提供了一个即用型流程，仅需最小调整即可将本方法扩展至其他目标领域。