Honeyfiles are security assets designed to attract and detect intruders on compromised systems. Honeyfiles are a type of honeypot that mimic real, sensitive documents, creating the illusion of the presence of valuable data. Interaction with a honeyfile reveals the presence of an intruder, and can provide insights into their goals and intentions. Their practical use, however, is limited by the time, cost and effort associated with manually creating realistic content. The introduction of large language models has made high-quality text generation accessible, but honeyfiles contain a variety of content including charts, tables and images. This content needs to be plausible and realistic, as well as semantically consistent both within honeyfiles and with the real documents they mimic, to successfully deceive an intruder. In this paper, we focus on an important component of the honeyfile content generation problem: document charts. Charts are ubiquitous in corporate documents and are commonly used to communicate quantitative and scientific data. Existing image generation models, such as DALL-E, are rather prone to generating charts with incomprehensible text and unconvincing data. We take a multi-modal approach to this problem by combining two purpose-built generative models: a multitask Transformer and a specialized multi-head autoencoder. The Transformer generates realistic captions and plot text, while the autoencoder generates the underlying tabular data for the plot. To advance the field of automated honeyplot generation, we also release a new document-chart dataset and propose a novel metric Keyword Semantic Matching (KSM). This metric measures the semantic consistency between keywords of a corpus and a smaller bag of words. Extensive experiments demonstrate excellent performance against multiple large language models, including ChatGPT and GPT4.
翻译:蜜罐文件是设计用于在受损系统中吸引和检测入侵者的安全资产。蜜罐文件是一种模拟真实敏感文档的蜜罐类型,通过制造存在有价值数据的假象来发挥作用。与蜜罐文件的交互会暴露入侵者的存在,并可揭示其目标与意图。然而,其实践应用受限于手动创建逼真内容所需的时间、成本和精力。大型语言模型的引入使高质量文本生成变得可行,但蜜罐文件包含图表、表格和图像等多种内容。这些内容需具备合理性和逼真性,同时需在蜜罐文件内部及与所模拟的真实文档之间保持语义一致性,才能成功欺骗入侵者。本文聚焦于蜜罐文件内容生成问题中的一个重要组成部分:文档图表。图表在企业文档中无处不在,常用于传达定量和科学数据。现有的图像生成模型(如DALL-E)在生成图表时极易出现文本难以理解、数据缺乏说服力的问题。我们采用多模态方法解决此问题,结合两种专用生成模型:多任务Transformer和专用多头自编码器。Transformer生成逼真的标题和绘图文本,而自编码器则生成绘图的基础表格数据。为推进自动化蜜罐图表生成领域的发展,我们还发布了一个新的文档-图表数据集,并提出一种新型指标——关键词语义匹配(KSM)。该指标用于衡量语料库关键词与较小词袋之间的语义一致性。大量实验表明,该方法在对抗包括ChatGPT和GPT4在内的多个大型语言模型时表现出卓越性能。