Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
翻译:自监督语音表征学习技术已被证明能够从语音暴露中发展语言能力,而无需依赖人工标注。为充分挖掘这些方法的潜力并深化对婴儿语言习得机制的理解,相关模拟必须紧密贴近真实场景——即在发展合理的语料库上训练模型,并依据适配的测试集进行基准评估。为此,我们提出一个面向语言习得的基准测试,用于探查口语语言模型的词汇与句法水平,这两类能力均符合儿童语言经验中的典型词汇范围。本文介绍该基准框架,并通过系列实验验证其实用性。此外,我们指出实现进一步突破需应对的两大挑战:弥合文本与语音之间的鸿沟,以及干净语音与自然场景语音之间的差异。