Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning

Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students' gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students' collaborative learning in real-world environments are also discussed.

翻译：先前研究已证实，分析协作学习中的注视行为能为学生反思学习过程提供具有教育意义的信息。过去数十年间，研究者开发了基于视频数据自动检测注视行为的机器学习方法。然而，由于这些方法通常需要大量标注数据进行训练，人工标注仍不可或缺。此外，研究者质疑已开发机器学习模型的跨场景鲁棒性，因为训练数据集往往无法涵盖教育情境中出现的全部场景。针对这些挑战，本研究提出一种可扩展的人工智能方法，通过利用预训练模型和基础模型在面对面协作学习情境中自动检测注视行为，无需人工标注数据。该方法采用预训练YOLO11进行人物追踪、支持文本提示的YOLOE-26进行教育相关物体检测，以及Gaze-LLE模型实现注视目标预测。结果表明，该方法在从视频数据检测学生注视行为时达到0.829的F1分数，对笔记本电脑导向注视和同伴导向注视具有优异表现，但对其他注视目标检测效果较弱。此外，与其他监督式机器学习方法相比，所提方法在复杂情境中展现出更优且更稳定的性能，凸显其更强的跨场景鲁棒性。本文同时探讨了该方法在真实环境中支持学生协作学习的应用意义。