Detecting firearms and accurately localizing individuals carrying them in images or videos is of paramount importance in security, surveillance, and content customization. However, this task presents significant challenges in complex environments due to clutter and the diverse shapes of firearms. To address this problem, we propose a novel approach that leverages human-firearm interaction information, which provides valuable clues for localizing firearm carriers. Our approach incorporates an attention mechanism that effectively distinguishes humans and firearms from the background by focusing on relevant areas. Additionally, we introduce a saliency-driven locality-preserving constraint to learn essential features while preserving foreground information in the input image. By combining these components, our approach achieves exceptional results on a newly proposed dataset. To handle inputs of varying sizes, we pass paired human-firearm instances with attention masks as channels through a deep network for feature computation, utilizing an adaptive average pooling layer. We extensively evaluate our approach against existing methods in human-object interaction detection and achieve significant results (AP=77.8\%) compared to the baseline approach (AP=63.1\%). This demonstrates the effectiveness of leveraging attention mechanisms and saliency-driven locality preservation for accurate human-firearm interaction detection. Our findings contribute to advancing the fields of security and surveillance, enabling more efficient firearm localization and identification in diverse scenarios.
翻译:在图像或视频中检测枪支并准确定位携带者,对安全监控与内容定制至关重要。然而,由于复杂环境中物体杂乱与枪支形态多样,该任务面临显著挑战。为解决此问题,我们提出一种利用人-枪交互信息的新方法,通过交互线索有效定位枪支携带者。该方法引入注意力机制,通过聚焦相关区域将人体与枪支从背景中有效分离。同时,我们提出基于显著性的局部保持约束,在学习关键特征的同时保留输入图像的前景信息。通过整合这些组件,该方法在新构建的数据集上取得了优异性能。为处理不同尺寸的输入,我们将成对的人-枪实例与注意力掩码作为通道输入深度网络,并利用自适应平均池化层进行特征计算。我们系统地将该方法与现有的人-物交互检测方法进行对比,相较于基线方法(AP=63.1%),本方法取得了显著效果(AP=77.8%)。这表明利用注意力机制与基于显著性的局部保持约束能有效提升人-枪交互检测精度。研究成果将推动安全监控领域发展,实现在多样化场景中更高效的枪支定位与识别。