The likelihood ratio framework is widely recognized as the logically and legally sound basis for evidential analysis across forensic sciences, and its importance is increasingly acknowledged in analyses of authorship in textual evidence. To date, however, its application has been confined to English-language texts. Meanwhile, authorship attribution has traditionally relied on a diverse array of stylometric features, even as the rise of pre-trained large language models enables new contextual-embedding approaches. Combining these diverse approaches through fusion promises enhanced performance, yet it has not been applied to integrate stylometric-feature systems with embedding-based systems within the likelihood ratio paradigm. This study is the first to apply likelihood ratio-based forensic text comparison to Japanese digital texts, using ~1,000-character excerpts from blogs, to 1) evaluate system performance and likelihood ratio magnitudes and 2) assess the impact of fusing stylometric-feature systems with embedding-based systems. The results demonstrate that the fused system maintains excellent calibration while 1) increasing consistent-with-fact likelihood ratio magnitudes; 2) decreasing contrary-to-fact likelihood ratio magnitudes and 3) improving overall discriminability. The best-performing fusion achieved a log-likelihood-ratio cost of 0.32484, illustrating both the feasibility of likelihood ratio framework for Japanese and the benefits of fusion across heterogeneous systems.
翻译:似然比框架被广泛认为是法庭科学中证据分析的逻辑与法律上可靠的基础,其在文本证据的作者身份分析中的重要性也日益获得认可。然而,迄今为止其应用仅局限于英文文本。与此同时,作者归因传统上依赖多样化的文体特征,而预训练大语言模型的兴起则催生了新的上下文嵌入方法。通过融合这些不同方法有望提升性能,但在似然比范式下,尚未有研究将文体特征系统与基于嵌入的系统进行集成。本研究首次将基于似然比的法庭文本比较方法应用于日语数字文本,使用博客中约1000字符的片段,旨在:1)评估系统性能与似然比量级;2)评价文体特征系统与基于嵌入的系统融合的影响。结果表明,融合系统在保持优异校准能力的同时:1)提升了与事实一致的似然比量级;2)降低了与事实相悖的似然比量级;3)改善了整体区分性。性能最优的融合实现了0.32484的对数似然比代价,既证明了似然比框架在日语文本中的可行性,也展示了跨异构系统融合的优势。