Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.
翻译:受大型语言模型中涌现出类人智能行为的启发,研究界正致力于在世界模型中探索类似的涌现能力,重点关注物理世界的建模。在物理世界模型的范畴内,物体是构成物理现实的基本基元。从人类到计算机,我们几乎与之交互的一切都是物体。这些物体很少是静态的;它们是具有内在属性所决定的不同状态的可操作实体。虽然当前方法通过视频生成或动态场景重建来处理物体的动作状态,但尚无一种方法以统一且规范的方式显式建模这一基本元素以构建可操作的物体表示。我们提出WorldString,这是一种神经架构,能够通过直接从点云或RGB-D视频流中学习来对现实世界物体的状态流形进行建模。作为通用的数字孪生体,它充当物理世界模型的基础构建块;因此,我们将其命名为WorldString。巧妙的是,其完全可微的结构无缝地支持了未来与策略学习和神经动力学的集成。