Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation

We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.

翻译：我们提出GraphDepth，一种单目深度估计架构，将图神经网络（GNN）协同集成于卷积编码器-解码器框架中。该方法在ResNet-101 U-Net骨干网的多尺度层级嵌入高效的GraphSAGE层，能够显式建模超出局部卷积感受野的长程空间关系。主要技术贡献包括：（1）基于批量并行化的图构建机制，采用可配置的k-最近邻与网格邻接策略，支持可扩展训练；（2）在瓶颈与解码器阶段（1/32、1/16、1/8分辨率）进行多尺度GraphSAGE集成，通过特征层级传递全局上下文；（3）引入通道注意力门控跳跃连接，在融合前对编码器特征进行自适应加权；（4）通过专用异方差不确定性估计头实现随机不确定性估计，优化时支持置信度感知的损失加权。不同于因序列长度呈二次复杂度而受限的Transformer混合架构，GraphDepth在保持与全局感受野等效的迭代消息传递能力下，其计算复杂度随空间分辨率线性增长。在NYU Depth V2、WHU Aerial、ETH3D及Mid-Air基准上的实验表明，该架构在室内场景中与最先进Transformer模型的精度差距控制在4.6%以内，且计算成本显著降低（25 FPS vs 9 FPS，3.8 GB vs 8.8 GB显存）。GraphDepth在WHU Aerial数据集上取得最优结果（RMSE 8.24米），并展现出对Mid-Air合成航空数据集的卓越零样本跨域迁移能力，验证了显式关系推理在深度估计中的泛化性能。