Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method to marginalize over any factors varied across VLM queries, utilizing the VLM's scores for sampled responses. We first show that this probabilistic aggregation can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object's type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show how VLMs can approach, without additional training or in-context learning, the quality of human-verified type and material annotations on the large-scale Objaverse dataset.
翻译:未标注的三维物体为预训练视觉语言模型(VLM)在从物体语义描述到物理属性等各类标注任务中的应用提供了契机。为获得准确响应,需综合考虑物体在三维空间中的完整外观、提问/提示的不同措辞方式,以及影响响应的其他因素变化。我们提出一种方法,通过利用VLM对采样响应的评分,边缘化VLM查询中所有变化的因素。首先,我们证明这种概率聚合方法在摘要生成任务上可超越语言模型(如GPT4),尤其能避免响应间存在对比细节时的幻觉现象。其次,实验表明聚合标注结果对提示链式推理有显著助益——当在提示中将物体类型作为辅助输入时,可提升下游VLM预测(如物体材质识别)的准确性。此类辅助输入使我们能够剥离并量化视觉推理相较于纯语言推理的贡献。通过上述评估,我们证明VLM无需额外训练或上下文学习,即可在大规模Objaverse数据集上达到接近人工验证的类型与材质标注质量。