Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.
翻译:自动语音识别在一系列基准测试中针对成人英语语音展现出超越人类的表现,但在处理儿童语音时却令人失望。这一直是儿童与机器人交互面临的长期障碍。近年来数据驱动语音识别的演进,包括Transformer架构的普及以及空前规模的训练数据,可能标志着儿童语音识别和面向儿童的社交机器人应用取得突破。我们重新审视了2017年一项关于儿童语音识别的研究,结果表明性能确实有所提升,其中新出现的OpenAI Whisper明显优于领先的商业云服务。尽管转录尚未达到完美,但最佳模型能正确识别60.3%的句子(忽略细微语法差异),且在本地GPU上运行转录时间低于一秒,展现了实现可用的自主儿童-机器人语音交互的潜力。