Language-Assisted Deep Learning for Autistic Behaviors Recognition

Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve persuasive performance on challenging activity datasets, e.g. daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e. Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e. 3.49% improvement on ESBD and 1.46% on SSBD).

翻译：正确识别自闭症谱系障碍（ASD）儿童的行为，对自闭症的诊断和及时早期干预至关重要。然而，自闭症儿童家长在治疗过程中的观察和记录可能缺乏准确性和客观性。在这种情况下，基于计算机视觉和机器学习（尤其是深度学习）技术的自动识别系统能够极大缓解这一问题。现有的人体行为识别模型已在具有挑战性的活动数据集（如日常活动和体育活动）上取得了令人信服的性能。然而，ASD儿童的问题行为与这些一般活动存在显著差异，通过计算机视觉识别此类问题行为的研究尚不充分。本文首先评估了行为识别的强基线模型——即Video Swin Transformer，在两个自闭症行为数据集（SSBD和ESBD）上，结果表明该模型能够实现高精度，并大幅超越先前方法，验证了基于视觉的问题行为识别的可行性。此外，我们提出语言辅助训练以进一步增强行为识别性能。具体而言，我们通过整合每种问题行为类型的“免费可用”语言描述，开发了一个双分支多模态深度学习框架。实验结果表明，与仅使用视频信息相比，引入额外的语言监督能够显著提升自闭症问题行为识别任务的性能（即在ESBD上提升3.49%，在SSBD上提升1.46%）。