Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.
翻译:学习驱动的场景图生成(SGG)模型在频繁关系类型上表现优异,但在标注稀疏条件下性能显著下降,难以捕获可靠的视觉常识知识。本文提出一种模型无关、语义引导的知识精炼框架,系统性地从训练数据中挖掘基于常识的约束——捕捉空间、功能及定性关系规律——并利用通用陈述性常识推理在推理阶段对排序后的SGG预测结果进行校正与精炼。该框架无需人工规则编写,无需模型重训练,且可跨数据集与架构迁移。在三个标准基准测试中,我们相较于强基线方法取得了一致的性能提升,表明基于深度场景语义的结构化视觉常识推理是纯学习式场景图生成的一种实用且有效的补充手段。