Traditional computer vision models often require extensive manual effort for data acquisition, annotation and validation, particularly when detecting subtle behavioral nuances or events. The difficulty in distinguishing routine behaviors from potential risks in real-world applications, such as differentiating routine shopping from potential shoplifting, further complicates the process. Moreover, these models may demonstrate high false positive rates and imprecise event detection when exposed to real-world scenarios that differ significantly from the conditions of the training data. To overcome these hurdles, we present Ethosight, a novel zero-shot computer vision system. Ethosight initiates with a clean slate based on user requirements and semantic knowledge of interest. Using localized label affinity calculations and a reasoning-guided iterative learning loop, Ethosight infers scene details and iteratively refines the label set. Reasoning mechanisms can be derived from large language models like GPT4, symbolic reasoners like OpenNARS\cite{wang2013}\cite{wang2006}, or hybrid systems. Our evaluations demonstrate Ethosight's efficacy across 40 complex use cases, spanning domains such as health, safety, and security. Detailed results and case studies within the main body of this paper and an appendix underscore a promising trajectory towards enhancing the adaptability and resilience of computer vision models in detecting and extracting subtle and nuanced behaviors.
翻译:摘要:传统计算机视觉模型在获取、标注和验证数据时通常需要大量人工投入,尤其在检测细微行为差异或事件时更为突出。在现实应用中区分常规行为与潜在风险(如区分日常购物与潜在盗窃)的困难进一步增加了复杂度。此外,当模型面临与训练数据条件显著不同的现实场景时,可能出现高误报率和事件检测不精确的问题。为克服这些障碍,我们提出Ethosight——一种新颖的零样本计算机视觉系统。Ethosight基于用户需求和感兴趣的语义知识从零开始初始化,通过局部标签亲和性计算与推理引导的迭代学习循环推断场景细节并逐步优化标签集。推理机制可源于大型语言模型(如GPT4)、符号推理器(如OpenNARS\cite{wang2013}\cite{wang2006})或混合系统。实验评估表明,Ethosight在涵盖健康、安全、安保等领域的40个复杂用例中表现优异。本文主体与附录中的详细结果及案例研究凸显了其在提升计算机视觉模型检测和提取细微行为方面的适应性与鲁棒性方面的潜力。