WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

翻译：腕部视角观测对于VLA模型至关重要，其能捕捉细粒度的手-物交互，直接提升操控性能。然而大规模数据集鲜少包含此类记录，导致丰富的锚定视角与稀缺的腕部视角之间存在显著差距。现有世界模型无法弥合此差距，因其需要腕部视角的首帧图像，从而无法仅从锚定视角生成腕部视角视频。在此背景下，近期出现的VGGT等视觉几何模型凭借几何与跨视角先验知识，为处理极端视角偏移提供了可能。受此启发，我们提出WristWorld——首个仅通过锚定视角即可生成腕部视角视频的4D世界模型。WristWorld分两阶段运行：（一）重建阶段：扩展VGGT框架并引入空间投影一致性损失函数，以估算几何一致的腕部视角位姿与4D点云；（二）生成阶段：采用视频生成模型从重建视角合成时序连贯的腕部视角视频。在Droid、Calvin和Franka Panda平台上的实验表明，本方法实现了具有卓越空间一致性的前沿视频生成性能，同时将VLA任务平均完成长度在Calvin数据集上提升3.81%，并弥合了42.4%的锚定-腕部视角差距。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日