Can Vision-Language Models Understand Construction Workers? An Exploratory Study

As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.

翻译：随着机器人技术日益融入建筑工作流程，其解释和响应人类行为的能力对于实现安全有效的协作至关重要。视觉-语言模型已成为视觉理解任务中一种有前景的工具，并展现出无需大量领域特定训练即可识别人类行为的潜力。这种能力使其在建筑领域尤其具有吸引力，因为该领域标注数据稀缺，而监测工人行为和情绪状态对安全与生产力至关重要。在本研究中，我们评估了三种领先的视觉-语言模型（GPT-4o、Florence 2 和 LLaVa-1.5）在静态工地图像中检测建筑工人行为和情绪方面的性能。使用一个包含1000张图像、标注了十种行为类别和十种情绪类别的精选数据集，我们通过标准化的推理流程和多种评估指标对每个模型的输出进行了评估。GPT-4o 在两项任务中均持续获得最高分，在行为识别中的平均 F1 分数为 0.756，准确率为 0.799；在情绪识别中的 F1 分数为 0.712，准确率为 0.773。Florence 2 表现中等，行为和情绪的 F1 分数分别为 0.497 和 0.414，而 LLaVa-1.5 的整体性能最低，行为和情绪的 F1 分数分别为 0.466 和 0.461。混淆矩阵分析表明，所有模型都难以区分语义相近的类别，例如团队协作与同主管沟通。虽然结果表明通用视觉-语言模型可以为建筑环境中的人类行为识别提供基础能力，但要实现实际应用的可靠性，可能还需要进一步的改进，例如领域适应、时序建模或多模态感知。