Dependency trees have proven to be a very successful model to represent the syntactic structure of sentences of human languages. In these structures, vertices are words and edges connect syntactically-dependent words. The tendency of these dependencies to be short has been demonstrated using random baselines for the sum of the lengths of the edges or its variants. A ubiquitous baseline is the expected sum in projective orderings (wherein edges do not cross and the root word of the sentence is not covered by any edge), that can be computed in time $O(n)$. Here we focus on a weaker formal constraint, namely planarity. In the theoretical domain, we present a characterization of planarity that, given a sentence, yields either the number of planar permutations or an efficient algorithm to generate uniformly random planar permutations of the words. We also show the relationship between the expected sum in planar arrangements and the expected sum in projective arrangements. In the domain of applications, we derive a $O(n)$-time algorithm to calculate the expected value of the sum of edge lengths. We also apply this research to a parallel corpus and find that the gap between actual dependency distance and the random baseline reduces as the strength of the formal constraint on dependency structures increases, suggesting that formal constraints absorb part of the dependency distance minimization effect. Our research paves the way for replicating past research on dependency distance minimization using random planar linearizations as random baseline.
翻译:依存句法树已被证明是表示人类语言句子句法结构的非常成功的模型。在这种结构中,顶点代表单词,边连接句法上相互依赖的单词。依赖关系倾向于短距离这一特性已通过随机基线(针对边的总长度或其变体)得到验证。一个常见的基线是投影序(其中边不交叉,且句子的根词不被任何边覆盖)中的期望和,可在 $O(n)$ 时间内计算得出。本文关注一个较弱的正式约束,即平面性。在理论方面,我们给出了平面性的一种特征描述,该描述可针对给定句子返回平面排列的数量或生成均匀随机平面排列的高效算法。我们还展示了平面排列中的期望和与投影排列中的期望和之间的关系。在应用方面,我们推导出一种 $O(n)$ 时间复杂度的算法来计算边的总长度的期望值。我们将此研究应用于平行语料库,发现随着依存结构正式约束强度的增加,实际依存距离与随机基线之间的差距减小,这表明正式约束部分吸收了依存距离最小化效应。我们的研究为使用随机平面线性化作为随机基线重复过往的依存距离最小化研究铺平了道路。