Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale `in-the-wild' dataset for identifying human perceptions about the `Most Important Person (MIP)' in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild' situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at https://github.com/surbhimadan92/MIP-GAF.
翻译:在任何社交场景中估计最重要人物(MIP)是一个极具挑战性的问题,主要源于上下文的复杂性以及标注数据的稀缺性。此外,MIP估计的因果性方面具有相当强的主观性和多样性。为此,我们旨在通过标注一个大规模“真实场景”数据集来解决该问题,以识别图像中人类对“最重要人物(MIP)”的感知。本文详细阐述了我们提出的基于多模态大语言模型(MLLM)的数据标注策略,并进行了深入的数据质量分析。进一步,我们利用最先进的MIP定位方法对提出的数据集进行了全面基准测试,结果表明其性能相较于现有数据集出现显著下降。这一性能下降表明,现有的MIP定位算法必须对“真实场景”情况具备更强的鲁棒性。我们相信,所提出的数据集将在构建下一代社交情境理解方法中发挥关键作用。代码与数据可在 https://github.com/surbhimadan92/MIP-GAF 获取。