De gustibus non est disputandum ("there is no accounting for others' tastes") is a common Latin maxim describing how many solutions in life are determined by people's personal preferences. Many household tasks, in particular, can only be considered fully successful when they account for personal preferences such as the visual aesthetic of the scene. For example, setting a table could be optimized by arranging utensils according to traditional rules of Western table setting decorum, without considering the color, shape, or material of each object, but this may not be a completely satisfying solution for a given person. Toward this end, we present DegustaBot, an algorithm for visual preference learning that solves household multi-object rearrangement tasks according to personal preference. To do this, we use internet-scale pre-trained vision-and-language foundation models (VLMs) with novel zero-shot visual prompting techniques. To evaluate our method, we collect a large dataset of naturalistic personal preferences in a simulated table-setting task, and conduct a user study in order to develop two novel metrics for determining success based on personal preference. This is a challenging problem and we find that 50% of our model's predictions are likely to be found acceptable by at least 20% of people.
翻译:"众口难调"(De gustibus non est disputandum)是一句拉丁格言,意指生活中的许多解决方案取决于个人偏好。尤其在家居任务中,只有充分考虑个人偏好(如场景的视觉美感)才能被视为完全成功。例如,布置餐桌时若仅依据西方餐桌礼仪的传统规则摆放餐具,而不考虑每件物品的颜色、形状或材质,虽可达到优化效果,却未必能让特定个体完全满意。为此,我们提出DegustaBot算法,该算法通过视觉偏好学习,依据个人偏好解决家居多物体重排任务。我们采用经过互联网规模预训练的视觉-语言基础模型(VLMs),结合新颖的零样本视觉提示技术实现这一目标。为评估方法性能,我们在模拟餐桌布置任务中收集了大规模自然主义个人偏好数据集,并通过用户研究开发出两种基于个人偏好的新型成功率评估指标。该问题具有挑战性,实验表明我们模型的预测结果中,有50%可能被至少20%的人群所接受。