We present a method for learning multiple scene representations given a small labeled set, by exploiting the relationships between such representations in the form of a multi-task hypergraph. We also show how we can use the hypergraph to improve a powerful pretrained VisTransformer model without any additional labeled data. In our hypergraph, each node is an interpretation layer (e.g., depth or segmentation) of the scene. Within each hyperedge, one or several input nodes predict the layer at the output node. Thus, each node could be an input node in some hyperedges and an output node in others. In this way, multiple paths can reach the same node, to form ensembles from which we obtain robust pseudolabels, which allow self-supervised learning in the hypergraph. We test different ensemble models and different types of hyperedges and show superior performance to other multi-task graph models in the field. We also introduce Dronescapes, a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.
翻译:我们提出一种方法,通过利用多任务超图中各场景表示之间的关联,仅需少量标注样本即可学习多种场景表示。同时,我们展示了如何在不使用额外标注数据的情况下,借助超图改进强大的预训练VisTransformer模型。在超图中,每个节点对应场景的一个解释层(如深度或分割)。每条超边内部,一个或多个输入节点预测输出节点对应的解释层。因此,同一节点在某些超边中可作为输入节点,在其他超边中则为输出节点。通过这种方式,多个路径可汇聚至同一节点,形成集成模型以生成鲁棒的伪标签,从而支持超图中的自监督学习。我们测试了不同集成模型及多种超边类型,实验结果表明其性能优于当前领域内的其他多任务图模型。此外,我们提出了Dronescapes数据集——一个由无人机在不同复杂真实场景中采集的大规模视频数据集,包含多种表示形式,适用于多任务学习。