Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.
翻译:稠密检索器将源文档压缩为(可能是有损的)向量表示,然而对于哪些信息丢失、哪些信息得以保留,以及这如何影响下游任务,目前尚缺乏分析。我们首次对稠密检索器所捕获的信息与其所基于的语言模型(例如BERT与Contriever)进行了比较分析。我们使用25个MultiBert检查点作为随机初始化,训练了MultiContrievers——一组包含25个Contriever模型。我们测试了能否从类维基百科文档的Contriever向量中提取特定信息片段——例如性别和职业。我们通过信息论探针测量这种可提取性。随后,我们检验了可提取性与性能及性别偏见之间的关系,以及这些结果对多种随机初始化和数据洗牌的敏感性。我们发现:(1)Contriever模型的可提取性显著提高,但可提取性通常与基准性能相关性较弱;(2)存在性别偏见,但并非由Contriever表示引起;(3)结果对随机初始化和数据洗牌均表现出高度敏感性,这表明未来的检索研究应在更广泛的初始化与数据洗牌范围内进行测试。