MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction

Automatic assessment of the quality of scholarly documents is a difficult task with high potential impact. Multimodality, in particular the addition of visual information next to text, has been shown to improve the performance on scholarly document quality prediction (SDQP) tasks. We propose the multimodal predictive model MultiSChuBERT. It combines a textual model based on chunking full paper text and aggregating computed BERT chunk-encodings (SChuBERT), with a visual model based on Inception V3.Our work contributes to the current state-of-the-art in SDQP in three ways. First, we show that the method of combining visual and textual embeddings can substantially influence the results. Second, we demonstrate that gradual-unfreezing of the weights of the visual sub-model, reduces its tendency to ovefit the data, improving results. Third, we show the retained benefit of multimodality when replacing standard BERT$_{\textrm{BASE}}$ embeddings with more recent state-of-the-art text embedding models. Using BERT$_{\textrm{BASE}}$ embeddings, on the (log) number of citations prediction task with the ACL-BiblioMetry dataset, our MultiSChuBERT (text+visual) model obtains an $R^{2}$ score of 0.454 compared to 0.432 for the SChuBERT (text only) model. Similar improvements are obtained on the PeerRead accept/reject prediction task. In our experiments using SciBERT, scincl, SPECTER and SPECTER2.0 embeddings, we show that each of these tailored embeddings adds further improvements over the standard BERT$_{\textrm{BASE}}$ embeddings, with the SPECTER2.0 embeddings performing best.

翻译：自动评估学术文档质量是一项极具挑战且具有高潜在影响的任务。研究表明，多模态方法——尤其在文本基础上融入视觉信息——能够提升学术文档质量预测（SDQP）任务的性能。本文提出多模态预测模型MultiSChuBERT，该模型整合了基于全文分块与BERT块编码聚合的文本模型SChuBERT，以及基于Inception V3的视觉模型。我们的工作从三个方面推动了SDQP领域的最新进展：第一，揭示了视觉与文本嵌入的组合方式对结果具有显著影响；第二，证明对视觉子模型权重进行渐进式解冻可降低其过拟合倾向，从而改善结果；第三，展示了在将标准BERT$_{\textrm{BASE}}$嵌入替换为更先进文本嵌入模型时，多模态方法仍能保持性能优势。基于BERT$_{\textrm{BASE}}$嵌入，在使用ACL-BiblioMetry数据集对（对数）引用次数进行预测的任务中，我们的MultiSChuBERT（文本+视觉）模型取得了0.454的$R^{2}$分数，而仅使用文本的SChuBERT模型为0.432。在PeerRead数据集的接收/拒稿预测任务中也获得了类似改进。通过使用SciBERT、scincl、SPECTER和SPECTER2.0嵌入的实验，我们证明这些定制嵌入相比标准BERT$_{\textrm{BASE}}$嵌入均能带来进一步提升，其中SPECTER2.0嵌入表现最佳。