We present a method for learning multiple scene representations given a small labeled set, by exploiting the relationships between such representations in the form of a multi-task hypergraph. We also show how we can use the hypergraph to improve a powerful pretrained VisTransformer model without any additional labeled data. In our hypergraph, each node is an interpretation layer (e.g., depth or segmentation) of the scene. Within each hyperedge, one or several input nodes predict the layer at the output node. Thus, each node could be an input node in some hyperedges and an output node in others. In this way, multiple paths can reach the same node, to form ensembles from which we obtain robust pseudolabels, which allow self-supervised learning in the hypergraph. We test different ensemble models and different types of hyperedges and show superior performance to other multi-task graph models in the field. We also introduce Dronescapes, a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.
翻译:我们提出了一种方法,通过利用多任务超图形式中表示之间的关系,在给定少量标注集的情况下学习多种场景表示。我们还展示了如何在不使用任何额外标注数据的情况下,利用超图改进强大的预训练VisTransformer模型。在我们的超图中,每个节点是场景的一个解释层(例如深度或分割)。在每个超边内部,一个或多个输入节点预测输出节点处的层。因此,每个节点在某些超边中可作为输入节点,而在另一些超边中可作为输出节点。通过这种方式,多条路径可以到达同一节点,形成集成,从而获得鲁棒的伪标签,实现超图内的自监督学习。我们测试了不同的集成模型和不同类型的超边,并展示了相较于该领域其他多任务图模型的优越性能。我们还引入了Dronescapes数据集,这是一个使用无人机在不同复杂真实场景中捕获的大型视频数据集,包含多种表示,适用于多任务学习。