In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.
翻译:当今时代,大语言模型和文本到图像模型的成功可归因于大规模数据集的驱动。然而,在三维视觉领域,尽管基于Objaverse和MVImgNet等大规模合成与真实捕获物体数据训练的模型取得了显著进展,但以人为中心的任务尚未达到同等水平,部分原因在于缺乏大规模人体数据集。由于获取大规模高质量三维人体数据面临巨大挑战,现有高保真度三维人体捕获数据集仍停留在中等规模。为弥补这一差距,我们提出MVHumanNet——一个包含4500个不同人体身份的多视角人体动作序列数据集。本工作的核心是采用多视角人体捕获系统,采集具有大量多样性身份和日常服装特征的人体数据,这有利于实现可扩展的数据收集。该数据集包含9000套日常服装、60000个运动序列和6.45亿帧图像,并配备丰富的标注信息,包括人体掩码、相机参数、二维和三维关键点、SMPL/SMPLX参数以及对应的文本描述。为探索MVHumanNet在各类二维和三维视觉任务中的潜力,我们开展了初步研究:视角一致的动作识别、人体NeRF重建、文本驱动的视角无约束人体图像生成,以及二维视角无约束人体图像和三维虚拟形象生成。大量实验表明,MVHumanNet的规模带来了性能提升和有效应用。作为当前最大规模的三维人体数据集,我们期待MVHumanNet数据及标注的发布能进一步推动大规模三维以人为中心任务的创新。