Much linguistic research relies on annotated datasets of features extracted from text corpora, but the rapid quantitative growth of these corpora has created practical difficulties for linguists to manually annotate large data samples. In this paper, we present a replicable, supervised method that leverages large language models for assisting the linguist in grammatical annotation through prompt engineering, training, and evaluation. We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y', based on the large language model Claude 3.5 Sonnet and corpus data from Davies' NOW and EnTenTen21 (SketchEngine). Overall, we reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data, validating the method for the annotation of very large quantities of tokens of the construction in the future. We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change, underlining the value of AI copilots as tools for future linguistic research.
翻译:大量语言学研究依赖于从文本语料库中提取特征的标注数据集,但这些语料库的快速量化增长给语言学家手动标注大规模数据样本带来了实际困难。本文提出一种可复现的监督方法,通过提示工程、训练与评估,利用大型语言模型辅助语言学家进行语法标注。我们以英语评价动词构式"consider X (as) (to be) Y"的形式变异为案例研究,引入基于大型语言模型Claude 3.5 Sonnet及Davies的NOW与EnTenTen21(SketchEngine)语料数据的方法流程。总体而言,仅使用少量训练数据,我们在预留测试样本上实现了超过90%的模型准确率,验证了该方法对未来海量构式标记标注的适用性。我们讨论了研究结果对更广泛语法构式及语法变异与演变案例研究的普适性,强调了AI协作者作为未来语言学研究工具的价值。