Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already intensively engaged in adapting duplicate detection algorithms to the changing circumstances, existing test data generators are still designed for small -- mostly relational -- datasets and can thus fulfill their intended task only to a limited extent. In this report, we present our ongoing research on a novel approach for test data generation that -- in contrast to existing solutions -- is able to produce large test datasets with complex schemas and more realistic error patterns while being easy to use for inexperienced users.
翻译:随着数据在我们生活几乎所有领域中的体量、波动性和多样性不断增加,从潜在关联数据源中检测重复数据的能力比以往任何时候都更为重要。然而,尽管研究界已积极致力于使重复检测算法适应不断变化的环境,现有测试数据生成器仍针对小型(主要为关系型)数据集设计,因此只能有限地完成其预期任务。本报告介绍我们正在进行的一项新方法研究——与现有解决方案不同,该方法能够生成具有复杂模式及更真实错误模式的大规模测试数据集,同时便于经验不足的用户使用。