Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample's nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-$K$ parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.
翻译:发现多模态话语的语义对于理解人类语言和增强人机交互至关重要。现有方法在利用非语言信息进行无监督场景下复杂语义的辨别方面存在局限性。本文提出了一种新颖的无监督多模态聚类方法(UMC),在该领域做出了开创性贡献。UMC引入了一种独特的方法来构建多模态数据的增广视图,随后利用这些视图进行预训练,为后续聚类建立良好初始化的表示。提出了一种创新策略,通过每个样本的最近邻密度动态选择高质量样本作为表示学习的指导。此外,该方法能够自动确定每个簇中最优的top-$K$参数值以精炼样本选择。最后,同时利用高质量和低质量样本学习有利于有效聚类的表示。我们在基准多模态意图及对话行为数据集上构建了基线方法。UMC在聚类指标上相较现有最佳方法实现了2-6%的显著提升,标志着该领域的首次成功尝试。完整代码与数据可从https://github.com/thuiar/UMC获取。