Human infants acquire their verbal lexicon with minimal prior knowledge of language based on the statistical properties of phonological distributions and the co-occurrence of other sensory stimuli. This study proposes a novel fully unsupervised learning method for discovering speech units using phonological information as a distributional cue and object information as a co-occurrence cue. The proposed method can acquire words and phonemes from speech signals using unsupervised learning and utilize object information based on multiple modalities-vision, tactile, and auditory-simultaneously. The proposed method is based on the nonparametric Bayesian double articulation analyzer (NPB-DAA) discovering phonemes and words from phonological features, and multimodal latent Dirichlet allocation (MLDA) categorizing multimodal information obtained from objects. In an experiment, the proposed method showed higher word discovery performance than baseline methods. Words that expressed the characteristics of objects (i.e., words corresponding to nouns and adjectives) were segmented accurately. Furthermore, we examined how learning performance is affected by differences in the importance of linguistic information. Increasing the weight of the word modality further improved performance relative to that of the fixed condition.
翻译:人类婴儿基于语音分布的统计特性与其他感官刺激的共现模式,在缺乏语言先验知识的情况下习得词汇。本研究提出一种完全无监督的学习方法,通过将语音信息作为分布线索、物体信息作为共现线索来发现语音单元。该方法能够利用无监督学习从语音信号中获取词汇和音素,并同时基于视觉、触觉和听觉等多模态信息利用物体信息。所提出的方法基于非参数贝叶斯双层析合分析器(NPB-DAA)从语音特征中发现音素和词汇,并结合多模态潜在狄利克雷分配(MLDA)对从物体获取的多模态信息进行聚类。实验结果表明,该方法在词汇发现性能上优于基线方法,且能准确切分表达物体特征的词汇(即对应名词和形容词的词语)。此外,我们探究了语言信息重要性差异对学习性能的影响:相较于固定条件,增加词汇模态的权重可进一步提升学习性能。