Classification with Costly Features (CwCF) is a classification problem that includes the cost of features in the optimization criteria. Individually for each sample, its features are sequentially acquired to maximize accuracy while minimizing the acquired features' cost. However, existing approaches can only process data that can be expressed as vectors of fixed length. In real life, the data often possesses rich and complex structure, which can be more precisely described with formats such as XML or JSON. The data is hierarchical and often contains nested lists of objects. In this work, we extend an existing deep reinforcement learning-based algorithm with hierarchical deep sets and hierarchical softmax, so that it can directly process this data. The extended method has greater control over which features it can acquire and, in experiments with seven datasets, we show that this leads to superior performance. To showcase the real usage of the new method, we apply it to a real-life problem of classifying malicious web domains, using an online service.
翻译:带成本特征的分类(CwCF)是一类将特征获取成本纳入优化准则的分类问题。针对每个样本,其特征被顺序采集,以在最大化准确率的同时最小化已获取特征的成本。然而,现有方法仅能处理可表示为固定长度向量的数据。现实中,数据往往具有丰富且复杂的结构,这类结构可通过XML或JSON等格式更精确地描述。这些数据具有层级性,且常包含嵌套的对象列表。本研究通过引入层级深度集合与层级Softmax方法,扩展了一种基于深度强化学习的现有算法,使其能够直接处理此类数据。扩展后的方法对可获取的特征具有更强的控制力,在七个数据集上的实验表明,该方法能实现更优的性能。为展示新方法的实际应用,我们将其应用于一个真实场景——利用在线服务对恶意网站域名进行分类。