Estimation of Psychosocial Work Environment Exposures Through Video Object Detection. Proof of Concept Using CCTV Footage

from arxiv, 11 pages, 9 figures, presented at IWOAR 9th International Workshop on Sensor-Based Activity Recognition and Artificial Intelligence, September 26-27, Potsdam, Germany

This paper examines the use of computer vision algorithms to estimate aspects of the psychosocial work environment using CCTV footage. We present a proof of concept for a methodology that detects and tracks people in video footage and estimates interactions between customers and employees by estimating their poses and calculating the duration of their encounters. We propose a pipeline that combines existing object detection and tracking algorithms (YOLOv8 and DeepSORT) with pose estimation algorithms (BlazePose) to estimate the number of customers and employees in the footage as well as the duration of their encounters. We use a simple rule-based approach to classify the interactions as positive, neutral or negative based on three different criteria: distance, duration and pose. The proposed methodology is tested on a small dataset of CCTV footage. While the data is quite limited in particular with respect to the quality of the footage, we have chosen this case as it represents a typical setting where the method could be applied. The results show that the object detection and tracking part of the pipeline has a reasonable performance on the dataset with a high degree of recall and reasonable accuracy. At this stage, the pose estimation is still limited to fully detect the type of interactions due to difficulties in tracking employees in the footage. We conclude that the method is a promising alternative to self-reported measures of the psychosocial work environment and could be used in future studies to obtain external observations of the work environment.

翻译：本文探讨了利用计算机视觉算法分析闭路电视录像以评估心理社会工作环境特征的方法。我们提出了一种概念验证方法：通过检测并追踪视频中的人员，结合姿态估计与接触时长计算，评估顾客与员工之间的互动行为。我们构建了一个处理流程，整合现有目标检测与追踪算法（YOLOv8与DeepSORT）以及姿态估计算法（BlazePose），用于估算录像中的顾客与员工数量及其接触时长。基于距离、时长和姿态三项标准，我们采用简单的规则方法将互动分为积极、中性或消极三类。所提方法在小型闭路电视录像数据集上进行了测试。尽管数据量有限（尤其录像质量方面），我们选择该案例是因为它代表了该方法可应用的典型场景。结果表明：流程中的目标检测与追踪模块在数据集上表现良好，具有较高的召回率与合理的准确度。现阶段，由于员工追踪存在困难，姿态估计模块尚无法完全检测互动类型。我们得出结论：该方法可作为心理社会工作环境自我报告测量的有效替代方案，未来研究可将其用于获取工作环境的外部观察数据。