Manual annotation remains the gold standard for high-quality, dense temporal video datasets, yet it is inherently time-consuming. Vision-language models can aid human annotators and expedite this process. We report on the impact of automatic Pre-Annotations from a tuned encoder on a Human-in-the-Loop labeling workflow for video footage. Quantitative analysis in a study of a single-iteration test involving 18 volunteers demonstrates that our workflow reduced annotation time by 35% for the majority (72%) of the participants. Beyond efficiency, we provide a rigorous framework for benchmarking AI-assisted workflows that quantifies trade-offs between algorithmic speed and the integrity of human verification.
翻译:手动标注仍是高质量密集时序视频数据集的黄金标准,但其本质上是耗时的。视觉-语言模型能够辅助人类标注者并加速这一过程。本文报告了在视频片段的人机协同标注工作流程中,采用调优编码器生成自动预标注的影响。一项涉及18名志愿者的单次迭代测试定量分析表明,对于大多数参与者(72%),我们的工作流程将标注时间减少了35%。除效率提升外,我们提出了一个严谨的基准测试框架,用于量化AI辅助工作流程中算法速度与人工验证完整性之间的权衡关系。