Ethosight: A Joint-Embedding Based System for Nuanced Perception Using Contextual Label Affinity Metric and Reasoning Based Iterative Learning

Traditional computer vision models often require extensive manual effort for data acquisition and validation, particularly when detecting subtle behavioral nuances or events. The difficulty in distinguishing routine behaviors from potential risks in real-world applications, like differentiating routine shopping from potential shoplifting, further complicates the process. We present Ethosight, a novel zero-shot computer vision algorithm. Ethosight eradicates the need for pre-existing symbolic knowledge, initiating from a clean slate based on user requirements and semantic knowledge of interest. Using localized label affinity calculations and a reasoning-guided iterative learning loop, Ethosight infers scene details and iteratively refines the label set. Reasoning mechanisms can be derived from large language models like GPT4, symbolic reasoners like OpenNARS, or hybrid systems. Ethosight further capitalizes on the capabilities of a pre-trained multi-modal model, ImageBind, generating accurate semantic knowledge of images within a few cycles. It successfully captures both explicit and nuanced elements efficiently. We also introduce the implementation of Korzybski's "time-binding" concept in machines, which allows for generational learning and knowledge sharing across deployments. Our evaluations demonstrate Ethosight's efficacy across 40 complex use cases. It has exhibited an exceptional ability to discern new areas of interest, consistently generating high-affinity scores within the top five labels from a set of a thousand. Tests conducted across diverse environments attest to Ethosight's robust performance. Detailed results and case studies within the main body of this paper and an appendix underscore a promising trajectory towards enhancing the adaptability and resilience of computer vision models in detecting and extracting subtle and nuanced behaviors.

翻译：摘要：传统计算机视觉模型在数据采集与验证环节往往需要大量人工投入，尤其在检测细微行为差异或事件时更为突出。在现实应用中，区分常规行为与潜在风险（如辨别日常购物与潜在偷窃行为）的困难性进一步加剧了这一过程。我们提出Ethosight——一种新颖的零样本计算机视觉算法。该算法无需预先存在的符号知识，完全基于用户需求与兴趣语义知识从零开始构建。通过局部标签亲和度计算与推理引导的迭代学习循环，Ethosight可推断场景细节并逐步优化标签集。推理机制可源自GPT4等大型语言模型、OpenNARS等符号推理器或混合系统。Ethosight进一步利用预训练多模态模型ImageBind的能力，在数轮迭代内即可生成精确的图像语义知识，高效捕捉显性与隐微元素。我们同时引入Korzybski的"时间绑定"概念在机器中的实现，实现跨部署的代际学习与知识共享。评估表明，Ethosight在40个复杂用例中表现优异，展现出从千个标签集中持续生成前五位高亲和度分数的非凡能力，能够精准识别新兴关注领域。跨多样环境的测试验证了Ethosight的稳健性能。本文正文及附录详述的量化结果与案例研究，为增强计算机视觉模型在检测与提取细微行为过程中适应性与鲁棒性指明了富有前景的发展方向。