This paper describes recent developments in object specific pose and shape prediction from single images. The main contribution is a new approach to camera pose prediction by self-supervised learning of keypoints corresponding to locations on a category specific deformable shape. We designed a network to generate a proxy ground-truth heatmap from a set of keypoints distributed all over the category-specific mean shape, where each is represented by a unique color on a labeled texture. The proxy ground-truth heatmap is used to train a deep keypoint prediction network, which can be used in online inference. The proposed approach to camera pose prediction show significant improvements when compared with state-of-the-art methods. Our approach to camera pose prediction is used to infer 3D objects from 2D image frames of video sequences online. To train the reconstruction model, it receives only a silhouette mask from a single frame of a video sequence in every training step and a category-specific mean object shape. We conducted experiments using three different datasets representing the bird category: the CUB [51] image dataset, YouTubeVos and the Davis video datasets. The network is trained on the CUB dataset and tested on all three datasets. The online experiments are demonstrated on YouTubeVos and Davis [56] video sequences using a network trained on the CUB training set.
翻译:本文介绍了从单张图像预测特定目标姿态与形状的最新进展。主要贡献在于提出了一种通过自监督学习关键点来预测相机姿态的新方法,这些关键点对应于类别特定可变形形状上的位置。我们设计了一个网络,从分布在类别平均形状上的关键点集合生成代理真值热力图,其中每个关键点通过标注纹理上的唯一颜色表示。该代理真值热力图用于训练深度关键点预测网络,该网络可进行在线推理。与现有最优方法相比,所提出的相机姿态预测方法展现出显著提升。该相机姿态预测方法用于从视频序列的二维图像帧中在线推断三维目标。在训练重建模型时,每个训练步骤仅接收来自视频序列单帧的轮廓掩膜和类别特定的平均目标形状。我们使用三个代表鸟类类别的数据集进行实验:CUB [51] 图像数据集、YouTubeVos 和 Davis 视频数据集。网络在 CUB 数据集上训练,并在所有三个数据集上测试。在线实验通过在 CUB 训练集上训练的网络,在 YouTubeVos 和 Davis [56] 视频序列上进行演示。