Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.
翻译:理解公共空间中的群体层面社会互动对于城市规划至关重要,可为设计具有社会活力和包容性的环境提供依据。从图像中检测此类互动涉及解释微妙视觉线索,如关系、邻近性和共同移动——这些语义复杂的信号超越了传统目标检测范畴。为应对这一挑战,我们提出了社会群体区域检测任务,该任务需要推理并空间定位由抽象人际关系定义的视觉区域。我们提出MINGLE(人际群体参与度建模),这是一个模块化三阶段流程,整合了:(1) 现成的人体检测与深度估计算法,(2) 基于视觉语言模型的成对社会归属关系分类推理,(3) 轻量级空间聚合算法以定位社会关联群体。为支持该任务并促进未来研究,我们提出了一个包含10万张城市街景图像的新数据集,其中标注了个人及社会互动群体的边界框和标签。这些标注结合了人工创建的标签与MINGLE流程的输出结果,确保了语义丰富性和对现实场景的广泛覆盖。