We present OAKINK2, a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation, OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks, treating them as a sequence of object affordance fulfillment. The first level, Affordance, outlines the functionalities that objects in the scene can afford, the second level, Primitive Task, describes the minimal interaction units that humans interact with the object to achieve its affordance, and the third level, Complex Task, illustrates how Primitive Tasks are composed and interdependent. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. Based on the 3-level abstraction of OAKINK2, we explore a task-oriented framework for Complex Task Completion (CTC). CTC aims to generate a sequence of bimanual manipulation to achieve task objectives. Within the CTC framework, we employ Large Language Models (LLMs) to decompose the complex task objectives into sequences of Primitive Tasks and have developed a Motion Fulfillment Model that generates bimanual hand motion for each Primitive Task. OAKINK2 datasets and models are available at https://oakink.net/v2.
翻译:我们提出OAKINK2,一个面向复杂日常活动的双手物体操作任务数据集。为将复杂任务构建为结构化表示,OAKINK2引入三个抽象层级组织操作任务:可供性、基本任务与复杂任务。OAKINK2以物体为中心视角解码复杂任务,将其视为物体可供性实现序列。第一层"可供性"定义场景中物体能提供的功能;第二层"基本任务"描述人类与物体交互以实现其可供性的最小交互单元;第三层"复杂任务"阐述基本任务的组合方式与相互依赖关系。该数据集提供多视角图像流及人体、双手与多种交互物体的精确位姿标注,支持交互重建与运动合成等应用。基于OAKINK2的三层抽象,我们探索面向任务的复杂任务完成框架(CTC)。CTC旨在生成实现任务目标的双手操作序列。在该框架中,我们利用大语言模型(LLMs)将复杂任务目标分解为基本任务序列,并开发了运动实现模型以生成每个基本任务的双手运动。OAKINK2数据集与模型已开源:https://oakink.net/v2。