We introduce PhotoBot, a framework for automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via a reference picture that is retrieved from a curated gallery. We exploit a visual language model (VLM) and an object detector to characterize reference pictures via textual descriptions and use a large language model (LLM) to retrieve relevant reference pictures based on a user's language query through text-based reasoning. To correspond the reference picture and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across significantly varying images. Using these features, we compute pose adjustments for an RGB-D camera by solving a Perspective-n-Point (PnP) problem. We demonstrate our approach on a real-world manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback.
翻译:摘要:我们提出PhotoBot,一种基于高层人类语言引导与机器人摄影师交互的自动化照片采集框架。我们建议通过从精选图库中检索的参考图片向用户传达摄影建议。我们利用视觉语言模型(VLM)和目标检测器通过文本描述表征参考图片,并使用大语言模型(LLM)基于用户的语言查询通过文本推理检索相关参考图片。为关联参考图片与观测场景,我们利用预训练的视觉变换器特征,该特征能捕捉显著不同图像间的语义相似性。基于这些特征,通过求解透视n点(PnP)问题计算RGB-D相机的位姿调整。我们在配备腕部相机的真实机械臂上验证了该方法。用户研究表明,根据人类反馈评估,PhotoBot拍摄的照片通常比用户自拍更具美学吸引力。