Conventional demographic inference methods have predominantly operated under the supervision of accurately labeled data, yet struggle to adapt to shifting social landscapes and diverse cultural contexts, leading to narrow specialization and limited accuracy in applications. Recently, the emergence of large multimodal models (LMMs) has shown transformative potential across various research tasks, such as visual comprehension and description. In this study, we explore the application of LMMs to demographic inference and introduce a benchmark for both quantitative and qualitative evaluation. Our findings indicate that LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs, albeit with a propensity for off-target predictions. To enhance LMM performance and achieve comparability with supervised learning baselines, we propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
翻译:传统的人口统计推断方法主要依赖于精确标注数据的监督,但难以适应不断变化的社会环境和多样化的文化背景,导致应用场景中的专业化程度有限且准确性不足。近年来,大规模多模态模型的出现已在视觉理解与描述等多种研究任务中展现出变革性潜力。本研究探索了LMM在人口统计推断中的应用,并提出了一个用于定量与定性评估的基准。我们的研究结果表明,LMM在零样本学习、可解释性以及处理未经整理的“野外”输入数据方面具有优势,尽管存在预测偏离目标的倾向。为提升LMM性能并达到与监督学习基线相当的水平,我们提出了一种思维链增强提示方法,该方法能有效缓解预测偏离目标的问题。