We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.
翻译:本文提出Matrix3D,一个统一模型,仅使用同一模型即可执行多项摄影测量子任务,包括姿态估计、深度预测和新视角合成。Matrix3D采用多模态扩散Transformer(DiT)来整合跨多种模态的变换,例如图像、相机参数和深度图。Matrix3D实现大规模多模态训练的关键在于引入了掩码学习策略。这使得即使面对部分完整的数据(例如仅包含图像-姿态和图像-深度对的二模态数据),也能进行全模态模型训练,从而显著扩大了可用训练数据的规模。Matrix3D在姿态估计和新视角合成任务上展现了最先进的性能。此外,它通过多轮交互提供细粒度控制,使其成为3D内容创作的创新工具。项目页面:https://nju-3dv.github.io/projects/matrix3d。