Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91\%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.
翻译:点击诱饵标题降低了在线信息质量并损害用户信任。本文提出一种混合式点击诱饵检测方法,将基于Transformer的文本嵌入与语言学驱动的信息性特征相结合。通过自然语言处理技术,我们评估了经典向量化器、词嵌入基线以及结合树基分类器的大型语言模型嵌入。我们性能最佳的模型——基于增强15个显式特征的嵌入的XGBoost——实现了91%的F1分数,其表现优于TF-IDF、Word2Vec、GloVe、基于LLM提示的分类以及纯特征基线。所提出的特征集通过突出第二人称代词、最高级、数字和注意力导向标点等显著语言线索,增强了模型可解释性,从而实现透明且校准良好的点击诱饵预测。我们公开了代码和训练模型以支持可重复研究。