Interactions involving children span a wide range of important domains from learning to clinical diagnostic and therapeutic contexts. Automated analyses of such interactions are motivated by the need to seek accurate insights and offer scale and robustness across diverse and wide-ranging conditions. Identifying the speech segments belonging to the child is a critical step in such modeling. Conventional child-adult speaker classification typically relies on audio modeling approaches, overlooking visual signals that convey speech articulation information, such as lip motion. Building on the foundation of an audio-only child-adult speaker classification pipeline, we propose incorporating visual cues through active speaker detection and visual processing models. Our framework involves video pre-processing, utterance-level child-adult speaker detection, and late fusion of modality-specific predictions. We demonstrate from extensive experiments that a visually aided classification pipeline enhances the accuracy and robustness of the classification. We show relative improvements of 2.38% and 3.97% in F1 macro score when one face and two faces are visible, respectively
翻译:涉及儿童的互动涵盖了从学习到临床诊断及治疗等广泛的重要领域。对此类互动进行自动化分析的需求源于获取精确洞察、并在多样化的广泛条件下提供规模性和鲁棒性的必要性。识别属于儿童的语音片段是此类建模中的关键步骤。传统的儿童-成人说话人分类通常依赖音频建模方法,忽略了传播发音信息的视觉信号(如嘴唇运动)。基于纯音频儿童-成人说话人分类管道的框架,我们提出通过主动说话人检测和视觉处理模型引入视觉线索。我们的框架包括视频预处理、语句级儿童-成人说话人检测以及模态特定预测的后期融合。通过大量实验证明,视觉辅助分类管道提升了分类的准确性和鲁棒性。当一人脸和两人脸可见时,我们分别展示了F1宏平均分数的相对提升2.38%和3.97%。