In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.
翻译:在口语处理领域,视听语音处理正受到越来越多的研究关注。该研究的关键组成部分包括唇语识别、视听语音识别和视觉到语音合成等任务。尽管已取得显著成功,但对视听任务的理论分析仍然不足。本文提出了一种基于信息论的量化分析方法,重点关注不同模态间的信息交集。我们的研究结果表明,该分析对于理解视听处理任务的难点以及通过模态整合可能获得的益处具有重要价值。