This paper revisits the problem of predicting box locations in object detection architectures. Typically, each box proposal or box query aims to directly maximize the intersection-over-union score with the ground truth, followed by a winner-takes-all non-maximum suppression where only the highest scoring box in each region is retained. We observe that both steps are sub-optimal: the first involves regressing proposals to the entire ground truth, which is a difficult task even with large receptive fields, and the second neglects valuable information from boxes other than the top candidate. Instead of regressing proposals to the whole ground truth, we propose a simpler approach: regress only to the area of intersection between the proposal and the ground truth. This avoids the need for proposals to extrapolate beyond their visual scope, improving localization accuracy. Rather than adopting a winner-takes-all strategy, we take the union over the regressed intersections of all boxes in a region to generate the final box outputs. Our plug-and-play method integrates seamlessly into proposal-based, grid-based, and query-based detection architectures with minimal modifications, consistently improving object localization and instance segmentation. We demonstrate its broad applicability and versatility across various detection and segmentation tasks.
翻译:本文重新审视了目标检测架构中边界框位置预测的问题。传统方法中,每个边界框提议或查询旨在直接最大化其与真实标注框的交并比得分,随后采用赢家通吃的非极大值抑制策略,仅保留每个区域中得分最高的边界框。我们观察到这两个步骤均非最优:前者需要将提议框回归至完整的真实标注框范围,即使在大感受野条件下这也是一项困难任务;后者则忽略了除最高得分候选框外其他边界框的宝贵信息。我们提出一种更简洁的方法:不将提议框回归至整个真实标注框,而仅回归至提议框与真实标注框的交集区域。这避免了提议框需要在其视觉范围外进行外推的问题,从而提升了定位精度。我们摒弃赢家通吃策略,改为对区域内所有边界框的回归交集区域取并集来生成最终边界框输出。这种即插即用方法仅需最小修改即可无缝集成至基于提议、基于网格和基于查询的检测架构中,持续提升目标定位和实例分割性能。我们在多种检测与分割任务中验证了该方法的广泛适用性和通用性。