Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
翻译:儿童对填充语-空位依存关系的习得机制存在理论分歧:部分学者主张其依赖于先天的语法知识,而另一些研究者则认为儿童导向语中的分布证据已足够支撑习得过程。然而,由于相关输入难以进行细粒度的大规模量化,该问题长期悬而未决。本研究提出一个自动识别系统,能够检测英语口语语料中三类核心填充语-空位结构——主句wh疑问句、嵌套wh疑问句及关系从句,并进一步识别其提取位置(即主语/宾语/附加语)。该方法融合了成分句法分析与依存句法分析,通过两者的优势互补实现结构分类与提取位置判定。经人工标注数据验证,该系统在多数类别中表现良好。将系统应用于57个英语CHILDES语料库后,我们得以量化儿童接收的填充语-空位输入特征及其在发育过程中产出此类结构的发展轨迹,包括特定结构的频率分布与提取位置的不对称性。生成的细粒度标注数据可为语言习得与计算研究提供支持,我们通过语言模型的过滤语料训练案例研究展示了其应用潜力。