Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
翻译:准确记录新生儿复苏过程对于质量改进和遵循临床指南至关重要,但在实践中仍未得到充分利用。先前使用3D-CNN和视觉Transformer(ViT)的研究在从新生儿复苏视频中检测关键活动方面已显示出有希望的结果,但也凸显了识别此类细粒度活动所面临的挑战。本研究探讨了生成式人工智能(GenAI)方法在改进此类视频活动识别方面的潜力。具体而言,我们探索了局部视觉语言模型(VLM)与大型语言模型(LLM)的结合使用,并将其与有监督的TimeSFormer基线模型进行比较。利用包含13.26小时新生儿复苏视频的模拟数据集,我们评估了多种基于零样本VLM的策略以及带有分类头(包括低秩自适应LoRA)的微调VLM。我们的结果表明,小型(局部)VLM存在幻觉问题,但当使用LoRA进行微调后,其F1分数达到0.91,超越了TimeSformer的0.70结果。