DVOS: Self-Supervised Dense-Pattern Video Object Segmentation

Video object segmentation approaches primarily rely on large-scale pixel-accurate human-annotated datasets for model development. In Dense Video Object Segmentation (DVOS) scenarios, each video frame encompasses hundreds of small, dense, and partially occluded objects. Accordingly, the labor-intensive manual annotation of even a single frame often takes hours, which hinders the development of DVOS for many applications. Furthermore, in videos with dense patterns, following a large number of objects that move in different directions poses additional challenges. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for DVOS utilizing a diffusion-based method through multi-task learning. Emulating real videos' optical flow and simulating their motion, we developed a methodology to synthesize computationally annotated videos that can be used for training DVOS models; The model performance was further improved by utilizing weakly labeled (computationally generated but imprecise) data. To demonstrate the utility and efficacy of the proposed approach, we developed DVOS models for wheat head segmentation of handheld and drone-captured videos, capturing wheat crops in fields of different locations across various growth stages, spanning from heading to maturity. Despite using only a few manually annotated video frames, the proposed approach yielded high-performing models, achieving a Dice score of 0.82 when tested on a drone-captured external test set. While we showed the efficacy of the proposed approach for wheat head segmentation, its application can be extended to other crops or DVOS in other domains, such as crowd analysis or microscopic image analysis.

翻译：视频目标分割方法主要依赖大规模像素级人工标注数据集进行模型开发。在密集视频目标分割场景中，每帧视频包含数百个小型、密集且部分遮挡的目标。因此，即使单帧图像的人工标注也常需数小时，这阻碍了DVOS在许多应用中的发展。此外，在具有密集模式的视频中，追踪大量沿不同方向运动的目标带来了额外挑战。为应对这些挑战，我们提出了一种基于扩散方法的半自监督时空DVOS方法，通过多任务学习实现。通过模拟真实视频的光流并生成其运动模式，我们开发了合成计算标注视频的方法论，可用于训练DVOS模型；利用弱标注（计算生成但不精确）数据进一步提升了模型性能。为验证所提方法的实用性与有效性，我们开发了针对手持设备和无人机拍摄视频的麦穗分割DVOS模型，这些视频采集了不同地区、从抽穗期到成熟期多个生长阶段的小麦作物。尽管仅使用少量人工标注视频帧，所提方法仍能生成高性能模型，在无人机拍摄的外部测试集上达到0.82的Dice分数。虽然我们展示了该方法在麦穗分割中的有效性，但其应用可扩展至其他作物或其他领域的DVOS任务，如人群分析或显微图像分析。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日