Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video

Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.

翻译：手工举升任务是导致职业性肌肉骨骼疾病的主要因素，有效的工效学风险评估对于量化身体暴露程度和指导工效学干预至关重要。修订版NIOSH举升方程（RNLE）是一种广泛使用的举升任务工效学风险评估工具，其依赖于六个任务变量，包括水平（H）和垂直（V）手部距离；此类距离通常通过手动测量或专用传感系统获取，难以在实际工作环境中应用。我们评估了使用创新的视觉语言模型（VLMs）从RGB视频流中无创估计H和V的可行性。开发了两种基于VLM的多阶段流程：一种文本引导的纯检测流程和一种检测加分割流程。两种流程均使用文本引导定位任务相关的感兴趣区域，从这些区域提取视觉特征，并采用基于Transformer的时间回归来估计举升开始和结束时的H与V。针对一系列举升任务，通过留一受试者交叉验证，对两种流程和七种摄像机视角条件下的估计性能进行了评估。结果在不同流程和摄像机视角条件下差异显著，其中基于分割的多视角流程始终产生最小的误差，在估计H时达到约6-8厘米的平均绝对误差，估计V时达到约5-8厘米的平均绝对误差。在所有流程和摄像机视角配置中，相对于纯检测流程，像素级分割将H的估计误差降低了约20-30%，将V的估计误差降低了约35-40%。这些发现支持了基于VLM的流程用于基于视频的RNLE距离参数估计的可行性。