In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.
翻译:本文介绍了我们团队HFUT-VUT在ACM Multimedia 2023多媒体系列挑战赛(MultiMediate Grand Challenge 2023)中的解决方案。该方案涵盖三个子挑战:身体行为识别、眼神接触检测和下一说话者预测。我们选用Swin Transformer作为基线模型,并利用数据增强策略解决上述三个任务。具体而言,我们对原始视频进行裁剪以消除其他部位的噪声干扰,同时采用数据增强方法提升模型的泛化能力。最终,我们的解决方案在相应测试集上取得了身体行为识别平均精度(mAP)0.6262的最佳成绩,眼神接触检测准确率达到0.7771。此外,在下一说话者预测任务中,我们的方法在未加权平均召回率(UAR)指标上达到0.5281的可比结果。