Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent. We compare two approaches: (1) cascading an unsupervised automatic speech recognition (ASR) model and an unsupervised parser to obtain parse trees on ASR transcripts, and (2) direct training an unsupervised parser on continuous word-level speech representations. This is done by first splitting utterances into sequences of word-level segments, and aggregating self-supervised speech representations within segments to obtain segment embeddings. We find that separately training a parser on the unpaired text and directly applying it on ASR transcripts for inference produces better results for unsupervised parsing. Additionally, our results suggest that accurate segmentation alone may be sufficient to parse spoken sentences accurately. Finally, we show the direct approach may learn head-directionality correctly for both head-initial and head-final languages without any explicit inductive bias.
翻译:以往的无监督句法分析研究局限于书面形式。本文首次针对无标注口语语句及非配对文本数据,开展了无监督口语成分句法分析研究。其目标是以成分句法树的形式确定口语语句的层级句法结构,其中每个节点对应一个成分的音频片段。我们比较了两种方法:(1)级联方法——将无监督自动语音识别(ASR)模型与无监督解析器串联,对ASR转录文本进行句法树解析;(2)直接方法——在连续词级语音表征上直接训练无监督解析器。具体实现时,首先将话语切分为词级片段序列,并在各片段内聚合自监督语音表征以获取片段嵌入。研究发现,在非配对文本上单独训练解析器并直接将其应用于ASR转录文本进行推理,可在无监督解析中取得更优结果。此外,实验结果表明,仅凭准确的分割或许就足以对口语语句进行精确解析。最后,我们证明直接方法能在无显式归纳偏置的情况下,正确学习头-尾语言(包括头部前置型与头部后置型语言)的核心词方向性。