Probabilistic Structured Queries (PSQ) is a cross-language information retrieval (CLIR) method that uses translation probabilities statistically derived from aligned corpora. PSQ is a strong baseline for efficient CLIR using sparse indexing. It is, therefore, useful as the first stage in a cascaded neural CLIR system whose second stage is more effective but too inefficient to be used on its own to search a large text collection. In this reproducibility study, we revisit PSQ by introducing an efficient Python implementation. Unconstrained use of all translation probabilities that can be estimated from aligned parallel text would in the limit assign a weight to every vocabulary term, precluding use of an inverted index to serve queries efficiently. Thus, PSQ's effectiveness and efficiency both depend on how translation probabilities are pruned. This paper presents experiments over a range of modern CLIR test collections to demonstrate that achieving Pareto optimal PSQ effectiveness-efficiency tradeoffs benefits from multi-criteria pruning, which has not been fully explored in prior work. Our Python PSQ implementation is available on GitHub(https://github.com/hltcoe/PSQ) and unpruned translation tables are available on Huggingface Models(https://huggingface.co/hltcoe/psq_translation_tables).
翻译:概率结构化查询(PSQ)是一种利用从对齐语料库中统计导出的翻译概率的跨语言信息检索(CLIR)方法。PSQ是使用稀疏索引实现高效CLIR的强基线方法。因此,它适用于作为级联式神经CLIR系统的第一阶段,而该系统的第二阶段虽然效果更优,但因效率过低无法独立用于大规模文本集合的检索。在本可复现性研究中,我们通过引入高效的Python实现重新审视了PSQ方法。若不加约束地使用所有可从对齐平行文本估计出的翻译概率,最终将导致每个词汇项都被赋予权重,从而无法利用倒排索引高效处理查询。因此,PSQ的效果与效率均取决于翻译概率的剪枝策略。本文通过一系列现代CLIR测试集合的实验证明,实现帕累托最优的PSQ效率-效果权衡需要采用多准则剪枝方法——该方向在以往研究中尚未得到充分探索。我们的Python版PSQ实现已开源至GitHub(https://github.com/hltcoe/PSQ),未剪枝的翻译表已发布于Huggingface Models(https://huggingface.co/hltcoe/psq_translation_tables)。