We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50\% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at https://fabiendelattre.com/robust-rotation-estimation.
翻译:我们提出一种方法,用于从手持单目视频中估计拥挤的真实场景下的相机旋转。尽管相机旋转估计是一个研究充分的问题,但此前的方法在该场景下均无法同时实现高精度与可接受的速度。由于现有数据集无法有效覆盖该场景,我们提供了包含17个视频序列的新数据集与基准,其地面真实标注具有高精度且经过严格验证。为宽基线立体视觉设计的方法(如五点法)在单目视频上表现不佳;而自动驾驶领域的方法(如SLAM)依赖特定传感器配置、运动模型或局部优化策略(滞后批处理),难以泛化至手持视频。此外,针对动态场景,RANSAC等常用鲁棒化技术需要大量迭代导致速度极慢。我们提出一种基于SO(3)上霍夫变换的新泛化方法,可高效鲁棒地寻找与光流最一致的相机旋转。在速度相当的同类方法中,我们的方法将误差降低近50%,且精度超越所有方法(无论速度优劣)。这为拥挤场景——计算机视觉的重要应用场景——建立了新的强性能基准。代码与数据集详见https://fabiendelattre.com/robust-rotation-estimation。