We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user's language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute suggested pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.
翻译:本文介绍PhotoBot框架,该系统通过高级人类语言引导与机器人摄影师的交互实现全自动照片采集。我们提出通过从精选图库中选择参考图像的方式向用户传达摄影建议。利用视觉语言模型(VLM)和对象检测器对参考图像进行文本描述,继而采用大语言模型(LLM)通过基于文本的推理,根据用户语言查询检索相关参考图像。为实现参考图像与观测场景的匹配,我们利用视觉Transformer的预训练特征来捕捉显著外观变化下的语义相似性。基于这些特征,通过求解透视n点(PnP)问题计算RGB-D相机的建议姿态调整方案。我们在配备腕部相机的机械臂上验证了该方法。用户研究表明,根据人类反馈评估,PhotoBot拍摄的照片在审美表现上通常优于用户自拍作品。同时我们证明PhotoBot可泛化应用于绘画等其他参考源。