Unsupervised word segmentation in audio utterances is challenging as, in speech, there is typically no gap between words. In a preliminary experiment, we show that recent deep self-supervised features are very effective for word segmentation but require supervision for training the classification head. To extend their effectiveness to unsupervised word segmentation, we propose a pseudo-labeling strategy. Our approach relies on the observation that the temporal gradient magnitude of the embeddings (i.e. the distance between the embeddings of subsequent frames) is typically minimal far from the boundaries and higher nearer the boundaries. We use a thresholding function on the temporal gradient magnitude to define a psuedo-label for wordness. We train a linear classifier, mapping the embedding of a single frame to the pseudo-label. Finally, we use the classifier score to predict whether a frame is a word or a boundary. In an empirical investigation, our method, despite its simplicity and fast run time, is shown to significantly outperform all previous methods on two datasets.
翻译:音频中的无监督词切分面临挑战,因为语音中通常不存在词间间隙。初步实验表明,尽管近期深度自监督特征在词切分任务中表现优异,但其分类头的训练仍需依赖监督信息。为将这些特征的有效性扩展至无监督词切分,我们提出一种伪标签策略。该策略基于以下观察:嵌入向量的时间梯度幅度(即连续帧嵌入之间的距离)在词边界处通常较大,而在非边界处较小。我们通过对时间梯度幅度设置阈值来生成词性伪标签,并训练线性分类器将单帧嵌入映射至该伪标签。最终,利用分类器得分判别当前帧属于词内还是边界。实验结果显示,尽管方法简单且运行迅速,但在两个数据集上均显著优于所有先前方法。