Reasonable pricing of data products enables data trading platforms to maximize revenue and foster the growth of the data trading market. The textual semantics of data products are vital for pricing and contain significant value that remains largely underexplored. Therefore, to investigate how textual features influence data product pricing, we employ five prevalent text representation techniques to encode the descriptive text of data products. And then, we employ six machine learning methods to predict data product prices, including linear regression, neural networks, decision trees, support vector machines, random forests, and XGBoost. Our empirical design consists of two tasks: a regression task that predicts the continuous price of data products, and a classification task that discretizes price into ordered categories. Furthermore, we conduct feature importance analysis by the mRMR feature selection method and SHAP-based interpretability techniques. Based on empirical data from the AWA Data Exchange, we find that for predicting continuous prices, Word2Vec text representations capturing semantic similarity yield superior performance. In contrast, for price-tier classification tasks, simpler representations that do not rely on semantic similarity, such as Bag-of-Words and TF-IDF, perform better. SHAP analysis reveals that semantic features related to healthcare and demographics tend to increase prices, whereas those associated with weather and environmental topics are linked to lower prices. This analytical framework significantly enhances the interpretability of pricing models.
翻译:数据产品的合理定价能够帮助数据交易平台实现收益最大化并促进数据交易市场的增长。数据产品的文本语义对定价至关重要,其中蕴含的重要价值目前尚未得到充分挖掘。为此,为探究文本特征如何影响数据产品定价,我们采用五种主流的文本表示技术对数据产品的描述文本进行编码。随后,我们运用六种机器学习方法预测数据产品价格,包括线性回归、神经网络、决策树、支持向量机、随机森林和XGBoost。我们的实证设计包含两项任务:一项是预测数据产品连续价格的回归任务,另一项是将价格离散化为有序类别的分类任务。此外,我们通过mRMR特征选择方法和基于SHAP的可解释性技术进行了特征重要性分析。基于来自AWA数据交易平台的实证数据,我们发现:在预测连续价格时,能够捕捉语义相似性的Word2Vec文本表示方法表现更优;而在价格层级分类任务中,不依赖语义相似性的更简单表示方法(如词袋模型和TF-IDF)效果更好。SHAP分析表明,与医疗健康和人口统计相关的语义特征往往推高价格,而与天气和环境主题相关的语义特征则与较低价格相关联。该分析框架显著提升了定价模型的可解释性。