PuzzleAvatar: Assembling 3D Avatars from Personal Albums

Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our model and data will be public.

翻译：生成个性化的三维虚拟化身对于增强现实/虚拟现实至关重要。然而，近期面向名人或虚构角色的文本到三维方法难以处理普通人。忠实重建的方法通常需要在受控环境下拍摄的全身图像。如果用户仅需上传其个人"今日穿搭"照片集即可获得一个忠实的虚拟化身，情况会如何？挑战在于此类日常照片集包含多样的姿态、具有挑战性的视角、裁剪视图以及遮挡（尽管服装、配饰和发型保持一致）。我们通过开发PuzzleAvatar来解决这一新颖的"相册到人体"任务，该模型能够从个人穿搭相册中生成忠实的三维虚拟化身（以规范姿态呈现），同时绕过了对身体姿态和相机姿态的困难估计。为此，我们在此类照片上微调了一个基础视觉语言模型，将人物的外观、身份、服装、发型和配饰编码为（独立的）学习令牌，并将这些线索注入到视觉语言模型中。实际上，我们利用这些学习令牌作为"拼图碎片"，从中组装出忠实且个性化的三维虚拟化身。重要的是，我们可以通过简单交换令牌来定制虚拟化身。作为这一新任务的基准，我们收集了一个名为PuzzleIOI的新数据集，包含41个对象总计近1K种穿搭配置，均为具有挑战性的局部照片并配有真实三维人体数据。评估表明，PuzzleAvatar不仅具有较高的重建精度，优于TeCH和MVDreamBooth，还具有对相册照片的独特可扩展性和强大的鲁棒性。我们的模型和数据将公开。