Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

翻译：本文提出一种新颖的树库驱动方法，利用依存句法标注语料库比较口语与书面语的句法结构。我们采用完全归纳式的自底向上方法，将句法结构定义为去词汇化的依存（子）树，并从英语和斯洛文尼亚语这两种句法结构迥异语言的口语及书面语通用依存树库（UD）中提取这些结构。针对每个语料库，我们分析了句法库的规模、多样性与分布特征、跨模态间的重叠程度，以及最具口语特征的结构。研究结果显示：在两种语言中，口语语料库所包含的句法结构均少于书面语料库且多样性更低，同时某些结构类型在跨语言与跨模态间呈现一致的偏好。引人注目的是，口语与书面语句法库的重合度极为有限：大多数口语中存在的结构并未出现在书面语中，这表明句法组织存在模态特异性偏好，反映了实时互动与精细写作的不同需求。通过对高频口语专属结构进行关键性分析，进一步支持了这一对比结果，该分析凸显了与互动性、语境锚定及表达经济性相关的模式。我们认为，这种可扩展的、与语言无关的框架为系统研究跨语料库的句法变异提供了有效的通用方法，为构建更全面的数据驱动型使用中语法理论奠定了基础。