Reconstructing dynamic assets from video data is central to many in computer vision and graphics tasks. Existing 4D reconstruction approaches are limited by category-specific models or slow optimization-based methods. Inspired by the recent Large Reconstruction Model (LRM), we present the Large Interpolation Model (LIM), a transformer-based feed-forward solution, guided by a novel causal consistency loss, for interpolating implicit 3D representations across time. Given implicit 3D representations at times $t_0$ and $t_1$, LIM produces a deformed shape at any continuous time $t\in[t_0,t_1]$, delivering high-quality interpolated frames in seconds. Furthermore, LIM allows explicit mesh tracking across time, producing a consistently uv-textured mesh sequence ready for integration into existing production pipelines. We also use LIM, in conjunction with a diffusion-based multiview generator, to produce dynamic 4D reconstructions from monocular videos. We evaluate LIM on various dynamic datasets, benchmarking against image-space interpolation methods (e.g., FiLM) and direct triplane linear interpolation, and demonstrate clear advantages. In summary, LIM is the first feed-forward model capable of high-speed tracked 4D asset reconstruction across diverse categories.
翻译:从视频数据重建动态资产是计算机视觉与图形学诸多任务的核心。现有4D重建方法受限于特定类别模型或基于优化的缓慢方法。受近期大型重建模型(LRM)启发,我们提出大型插值模型(LIM)——一种基于Transformer的前馈解决方案,通过新颖的因果一致性损失指导,实现跨时间隐式3D表示的插值。给定时刻$t_0$和$t_1$的隐式3D表示,LIM可在任意连续时间$t\in[t_0,t_1]$生成形变后的形状,在数秒内提供高质量插值帧。此外,LIM支持跨时间的显式网格追踪,生成具有一致UV贴图的网格序列,可直接集成到现有生产管线中。我们还结合基于扩散的多视角生成器,利用LIM从单目视频生成动态4D重建结果。我们在多种动态数据集上评估LIM,与图像空间插值方法(如FiLM)及直接三平面线性插值进行对比,证明了其显著优势。总之,LIM是首个能够跨多样类别实现高速追踪式4D资产重建的前馈模型。