Cubify Anything: Scaling Indoor 3D Object Detection

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.

翻译：本研究探讨基于消费级手持设备获取的单帧RGB(-D)图像进行室内三维物体检测的问题。我们致力于在数据和建模两方面显著提升现有水平。首先，我们指出现有数据集在规模、精度和物体多样性方面存在显著局限。为此，我们提出了Cubify-Anything 1M（CA-1M）数据集，该数据集在超过1000个高精度激光扫描场景中详尽标注了40余万个三维物体，并与超过3500个手持式第一视角采集数据实现了近乎完美的配准。其次，我们构建了Cubify Transformer（CuTR）这一全Transformer三维物体检测基线模型。该模型摒弃了基于点云或体素的三维特征处理方式，直接通过RGB(-D)输入提取的二维特征预测三维边界框。尽管这种方法缺乏三维归纳偏置，但我们证明当与CA-1M数据集结合时，CuTR的性能超越了基于点云的方法——在三维空间中准确召回超过62%的物体，并且能显著更好地处理消费级LiDAR深度图中存在的噪声与不确定性，同时在不改变架构的情况下仅使用RGB输入也展现出良好性能。此外，通过在CA-1M上进行预训练，CuTR在更多样化的SUN RGB-D变体数据集上也能超越基于点云的方法。这支持了以下观点：虽然三维归纳偏置在现有小规模数据集中具有优势，但无法适应CA-1M这种数据丰富的场景。总体而言，该数据集与基线模型有力证明了我们正在朝着能够有效实现"万物立方化"的模型迈进。