Misinformation is still a major societal problem and the arrival of Large Language Models (LLMs) only added to it. This paper analyzes synthetic, false, and genuine information in the form of text from spectral analysis, visualization, and explainability perspectives to find the answer to why the problem is still unsolved despite multiple years of research and a plethora of solutions in the literature. Various embedding techniques on multiple datasets are used to represent information for the purpose. The diverse spectral and non-spectral methods used on these embeddings include t-distributed Stochastic Neighbor Embedding (t-SNE), Principal Component Analysis (PCA), and Variational Autoencoders (VAEs). Classification is done using multiple machine learning algorithms. Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Integrated Gradients are used for the explanation of the classification. The analysis and the explanations generated show that misinformation is quite closely intertwined with genuine information and the machine learning algorithms are not as effective in separating the two despite the claims in the literature.
翻译:错误信息仍是重大的社会问题,而大型语言模型(LLM)的出现加剧了这一问题。本文从谱分析、可视化和可解释性角度,对文本形式的合成、虚假及真实信息进行分析,旨在探究为何经过多年研究且文献中存在大量解决方案,该问题仍未得到解决。研究采用多种嵌入技术对多个数据集进行信息表征。在这些嵌入表示上应用的多样化谱方法与非谱方法包括t分布随机邻域嵌入(t-SNE)、主成分分析(PCA)和变分自编码器(VAE)。分类任务通过多种机器学习算法实现。局部可解释模型无关解释(LIME)、SHapley加性解释(SHAP)和积分梯度方法被用于分类结果的解释。分析与生成的解释表明,错误信息与真实信息存在紧密交织,且机器学习算法在区分二者方面的有效性并未达到文献中宣称的水平。