End-to-end automatic speech recognition directly maps input speech to characters. However, the mapping can be problematic when several different pronunciations should be mapped into one character or when one pronunciation is shared among many different characters. Japanese ASR suffers the most from such many-to-one and one-to-many mapping problems due to Japanese kanji characters. To alleviate the problems, we introduce explicit interaction between characters and syllables using Self-conditioned connectionist temporal classification (CTC), in which the upper layers are ``self-conditioned'' on the intermediate predictions from the lower layers. The proposed method utilizes character-level and syllable-level intermediate predictions as conditioning features to deal with mutual dependency between characters and syllables. Experimental results on Corpus of Spontaneous Japanese show that the proposed method outperformed the conventional multi-task and Self-conditioned CTC methods.
翻译:端到端自动语音识别直接将输入语音映射到字符。然而,当多个不同发音应映射到同一字符时,或当同一发音对应多个不同字符时,这种映射可能存在问题。日语ASR因日语汉字字符而受此类多对一和一对多映射问题影响最为严重。为缓解这些问题,我们利用自条件化连接主义时序分类(CTC)引入字符与音节之间的显式交互,其中上层依赖下层的中间预测进行"自条件化"。所提方法利用字符级和音节级中间预测作为条件特征,以处理字符与音节之间的相互依赖关系。在日语自发语音语料库上的实验结果表明,所提方法优于传统多任务和自条件化CTC方法。