In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. Code is available at https://github.com/masked-spacetime-hashing/msth
翻译:本文提出遮罩时空哈希编码(MSTH),一种从多视角或单目视频高效重建动态三维场景的新方法。基于动态场景常包含大量静态区域导致存储和计算冗余的观察,MSTH将动态场景表示为三维哈希编码与四维哈希编码的加权组合。两个分量的权重由基于不确定性目标引导的可学习遮罩表示,以反映每个三维位置的空间和时间重要性。通过该设计,我们的方法可避免对静态区域的冗余查询和修改,降低哈希碰撞率,从而能用小尺寸哈希表表示大量时空体素。此外,无需独立拟合大量时间冗余特征,该方法更易优化,可在仅20分钟训练时间内快速收敛至300帧动态场景。因此,MSTH在仅需20分钟训练时间和130MB存储空间的情况下,持续获得优于以往方法的结果。代码发布于https://github.com/masked-spacetime-hashing/msth