As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English evaluative consider construction (consider X as/to be/Ø Y). We annotate 143,933 'consider' concordance lines from the Corpus of Historical American English (COHA) via the OpenAI API in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. A Bayesian multinomial GAM fitted to 44,527 true positives of the evaluative construction reveals previously undocumented genre-specific trajectories of change, enabling us to advance new hypotheses about the relationship between register formality and competing pressures of morphosyntactic reduction and enhancement. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, unlocking substantive research questions previously beyond practical reach, though implementation requires attention to costs, licensing, and other ethical considerations.
翻译:随着自然语言语料库以空前的速度扩展,人工标注仍是语料库语言学方法上的主要瓶颈。我们通过提出一种利用大语言模型(LLM)实现海量语料库语法标注自动化的可扩展流水线来应对这一挑战。与以往的监督式及迭代式方法不同,我们的方法采用四阶段工作流程:提示工程、事前评估、自动化批量处理和事后验证。我们通过一项关于英语评价性 consider 构式(consider X as/to be/Ø Y)变异的历时案例研究,展示了该流水线的易用性与有效性。我们通过 OpenAI API 在不到60小时内对来自《历史美国英语语料库》(COHA)的143,933条consider索引行进行了标注,在两个复杂的标注流程中实现了98%以上的准确率。基于44,527个评价性构式真阳性实例拟合的贝叶斯多项广义加性模型(GAM)揭示了先前未记录到的、受语域专门化影响的演变轨迹,使我们得以提出关于语域正式程度与形态句法简化及增强两种竞争压力之间关系的新假设。我们的结果表明,大语言模型能够以最少的人工干预大规模执行一系列数据准备任务,从而开启此前实际难以触及的实质性研究问题,但实施过程中需关注成本、许可及其他伦理考量。