Implicit Self-supervised Language Representation for Spoken Language Diarization

In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework.

翻译：在代码切换（CS）场景中，使用口语语言对话分割（LD）作为预处理系统至关重要。此外，隐式框架相比显式框架更受青睐，因为它易于适应低资源/零资源语言的处理。受说话人对话分割（SD）文献启发，本文提出了三种基于（1）固定分割、（2）基于变化点分割和（3）端到端（E2E）的框架来执行LD。使用合成TTSF-LD数据集的初步探索表明，采用x-vector作为隐式语言表示，并选取合适的分析窗长（$N$），可获得与显式LD相当的性能。在Jaccard错误率（JER）指标上，使用E2E框架实现了最佳隐式LD性能$6.38$。然而，当应用于实际的微软CS（MSCS）数据集时，E2E框架下的隐式LD性能下降至$60.4$。性能差异主要源于MSCS与TTSF-LD数据集中第二语言单语段时长的分布差异。此外，为避免段平滑，较短的单语段时长建议使用较小的$N$值。然而，当$N$较小时，由于同一说话人使用两种语言产生的声学相似性，x-vector表示无法捕捉所需的语言区分性。因此，本研究提出了一种自监督隐式语言表示来解决该问题。与x-vector表示相比，所提表示提供了$63.9\%$的相对改进，并使用E2E框架实现了$21.8$的JER。