This paper describes recent developments in object specific pose and shape prediction from single images. The main contribution is a new approach to camera pose prediction by self-supervised learning of keypoints corresponding to locations on a category specific deformable shape. We designed a network to generate a proxy ground-truth heatmap from a set of keypoints distributed all over the category-specific mean shape, where each is represented by a unique color on a labeled texture. The proxy ground-truth heatmap is used to train a deep keypoint prediction network, which can be used in online inference. The proposed approach to camera pose prediction show significant improvements when compared with state-of-the-art methods. Our approach to camera pose prediction is used to infer 3D objects from 2D image frames of video sequences online. To train the reconstruction model, it receives only a silhouette mask from a single frame of a video sequence in every training step and a category-specific mean object shape. We conducted experiments using three different datasets representing the bird category: the CUB [51] image dataset, YouTubeVos and the Davis video datasets. The network is trained on the CUB dataset and tested on all three datasets. The online experiments are demonstrated on YouTubeVos and Davis [56] video sequences using a network trained on the CUB training set.
翻译:本文描述了从单张图像预测特定物体姿态与形状的最新进展。主要贡献在于一种新的相机姿态预测方法,该方法通过自监督学习对应于类别特定可变形形状位置的关键点实现。我们设计了一个网络,从均匀分布于类别特定平均形状上的一组关键点生成代理真实热力图,其中每个关键点通过标注纹理上的唯一颜色表示。该代理真实热力图用于训练一个深度关键点预测网络,该网络可应用于在线推理。与现有最优方法相比,所提出的相机姿态预测方法展现出显著提升。我们的相机姿态预测方法可在线从视频序列的二维图像帧中推断三维物体。为训练重建模型,每个训练步骤仅接收视频序列单帧的轮廓掩膜及类别特定平均物体形状。我们使用三个代表鸟类类别的数据集进行了实验:CUB [51]图像数据集、YouTubeVos及Davis视频数据集。网络在CUB数据集上训练,并在所有三个数据集上测试。基于YouTubeVos和Davis [56]视频序列的在线实验,使用在CUB训练集上训练的网络进行演示。