We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.
翻译:我们提出了一种新颖的有监督学习方法,用于政治意识形态预测(PIP),该方法能够预测分布外输入。这一动机源于手动数据标注成本高昂,而自我报告标签往往稀缺且存在显著的选择偏差。我们提出了一种新颖的统计模型,将文档嵌入分解为两个向量的线性叠加:一个独立于意识形态的潜在中性*语境*向量,以及一个与意识形态对齐的潜在*立场*向量。我们训练了一个端到端模型,其输出为中间阶段的语境向量和立场向量。在部署时,该模型仅利用预测的立场向量为输入文档输出标签。在两个基准数据集上,我们证明了即使仅使用5%的偏差数据进行训练,该模型也能输出预测,并且其准确率显著优于现有最先进方法。通过众包实验,我们验证了语境向量的中性性质,并表明语境过滤导致了意识形态的集中,从而能够对分布外样本进行预测。