This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning.
翻译:本文揭示了音素与面部特征之间隐秘的关联。传统的声音-面部相关性研究通常需要使用长时间的声音输入,包括从声音生成面部图像以及从声音重建三维面部网格。然而,在基于声音的犯罪等场景中,可用的声音证据可能短暂且有限。此外,从生理学角度来看,语音的每个片段——音素——对应着面部不同的气流类型和运动模式。因此,发现音素与面部属性之间的隐藏联系具有重要价值。本文提出了一套分析框架,以细粒度方式探索声音-面部关系,即音素与面部人体测量指标之间的关联。我们为每个音素-测量指标对构建了估计器,并通过假设检验评估其相关性。结果表明,相较于辅音,元音(尤其是爆破音)能更可靠地预测面部人体测量指标。此外,我们观察到,若某个面部测量指标在音素发音过程中表现出更明显的运动,其可预测性也更高。本研究的发现印证了生理学中的相关性结论,并为未来语音-面部多模态学习研究奠定了基础。