Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning

High-quality and consistent annotations are fundamental to the successful development of robust machine learning models. Traditional data annotation methods are resource-intensive and inefficient, often leading to a reliance on third-party annotators who are not the domain experts. Hard samples, which are usually the most informative for model training, tend to be difficult to label accurately and consistently without business context. These can arise unpredictably during the annotation process, requiring a variable number of iterations and rounds of feedback, leading to unforeseen expenses and time commitments to guarantee quality. We posit that more direct involvement of domain experts, using a human-in-the-loop system, can resolve many of these practical challenges. We propose a novel framework we call Video Annotator (VA) for annotating, managing, and iterating on video classification datasets. Our approach offers a new paradigm for an end-user-centered model development process, enhancing the efficiency, usability, and effectiveness of video classifiers. Uniquely, VA allows for a continuous annotation process, seamlessly integrating data collection and model training. We leverage the zero-shot capabilities of vision-language foundation models combined with active learning techniques, and demonstrate that VA enables the efficient creation of high-quality models. VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline across a wide-ranging assortment of tasks. We release a dataset with 153k labels across 56 video understanding tasks annotated by three professional video editors using VA, and also release code to replicate our experiments at: http://github.com/netflix/videoannotator.

翻译：高质量且一致的标注是开发稳健机器学习模型的基础。传统数据标注方法资源密集且效率低下，往往依赖缺乏领域专业知识的第三方标注者。对模型训练最具信息量的困难样本，若无业务背景通常难以准确一致地标注——这类样本可能在标注过程中不可预测地出现，需要可变次数的迭代与反馈轮次，导致确保质量所需的时间和费用难以预估。我们认为，采用人在回路系统让领域专家更直接参与，可解决诸多实际挑战。本文提出名为"Video Annotator (VA)"的新框架，用于视频分类数据集的标注、管理与迭代优化。该方法开创了以终端用户为中心的模型开发新范式，显著提升视频分类器的效率、可用性与有效性。独特之处在于，VA支持持续标注流程，无缝整合数据采集与模型训练。我们利用视觉-语言基础模型的零样本能力结合主动学习技术，证明VA能高效创建高质量模型。在广泛多样的任务集上，VA相较最具竞争力的基线方法，平均精度中位数提升6.8个百分点。我们发布了一个包含56项视频理解任务、由三位专业视频编辑使用VA标注的15.3万标签数据集，并公开实验复现代码：http://github.com/netflix/videoannotator。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日