Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.
翻译:我们的工作探讨了如何利用大型语言模型进行机器人规划与采样,具体聚焦于自动化摄影记录场景。通过结合近期通用语言模型(LM)与视觉语言模型(VLM)的进展,我们展示了如何构建具有卓越语义感知能力的拍照机器人。给定对某个事件的高层描述,我们首先利用语言模型生成预期摄影师在该事件中应拍摄的自然语言照片描述列表,随后通过视觉语言模型在机器人的视频流中识别与这些描述最匹配的图像。经人类评估者判定,相比现有方法,我们的方法生成的摄影作品集始终被认为更符合事件场景需求。