Cross-linguistically Consistent Semantic and Syntactic Annotation of Child-directed Speech

This paper proposes a methodology for constructing such corpora of child directed speech (CDS) paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown's Adam corpus (English; we annotate ~80% of its child-directed utterances), all child-directed utterances from Berman's Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.

翻译：本文提出了一种构建儿童导向话语（CDS）与句子逻辑形式配对语料库的方法，并利用该方法创建了英语和希伯来语两个此类语料库。该方法基于依存表示与语义解析的最新进展，强制执行跨语言一致的表示。具体而言，该方法包含两个步骤：首先，使用为广泛领域及类型学多样语言而开发的通用依存（UD）句法标注方案对语料库进行标注；其次，通过自动方法从UD结构中转换得出句子逻辑形式（LF）以进一步标注数据。UD与LF表示具有互补优势：UD结构具有语言中立性，支持多个标注者进行一致可靠的标注；而LF独立于句法推导过程，可透明编码语义关系。通过此方法，我们为CHILDES语料库中的两个子库提供了句法与语义标注：Brown的Adam语料库（英语，标注约80%的儿童导向话语）与Berman的Hagar语料库（希伯来语，全部儿童导向话语）。我们通过标注者间一致性研究验证了UD标注质量，并人工评估了转换后的意义表示。最后，通过（1）纵向语料库研究分析CDS中不同句法与语义现象的普遍性，以及（2）将现有语言习得计算模型应用于两个语料库并简要比较跨语言结果，展示了所构建语料库的实用性。