An abundance of information about cancer exists online, but categorizing and extracting useful information from it is difficult. Almost all research within healthcare data processing is concerned with formal clinical data, but there is valuable information in non-clinical data too. The present study combines methods within distributed computing, text retrieval, clustering, and classification into a coherent and computationally efficient system, that can clarify cancer patient trajectories based on non-clinical and freely available information. We produce a fully-functional prototype that can retrieve, cluster and present information about cancer trajectories from non-clinical forum posts. We evaluate three clustering algorithms (MR-DBSCAN, DBSCAN, and HDBSCAN) and compare them in terms of Adjusted Rand Index and total run time as a function of the number of posts retrieved and the neighborhood radius. Clustering results show that neighborhood radius has the most significant impact on clustering performance. For small values, the data set is split accordingly, but high values produce a large number of possible partitions and searching for the best partition is hereby time-consuming. With a proper estimated radius, MR-DBSCAN can cluster 50000 forum posts in 46.1 seconds, compared to DBSCAN (143.4) and HDBSCAN (282.3). We conduct an interview with the Danish Cancer Society and present our software prototype. The organization sees a potential in software that can democratize online information about cancer and foresee that such systems will be required in the future.
翻译:线上存在大量关于癌症的信息,但从中分类和提取有用信息颇具难度。几乎所有医疗数据处理研究都关注正式临床数据,然而非临床数据中同样蕴含有价值信息。本研究整合分布式计算、文本检索、聚类和分类方法,构建了一个连贯且计算高效的系统,能够基于非临床且公开可用的信息清晰呈现癌症患者病程轨迹。我们开发了一个全功能原型系统,可从非临床论坛帖子中检索、聚类并呈现癌症轨迹相关信息。我们评估了三种聚类算法(MR-DBSCAN、DBSCAN、HDBSCAN),并依据调整兰德指数(Adjusted Rand Index)及总运行时间(作为检索帖子数量与邻域半径的函数)进行比较。聚类结果表明,邻域半径对聚类性能影响最为显著。当半径较小时,数据集会被合理分割;但半径较大时会产生大量可能的分区,因此搜索最佳分区耗时显著。在合理估计半径条件下,MR-DBSCAN对5万条论坛帖子进行聚类仅需46.1秒,而DBSCAN和HDBSCAN分别需143.4秒和282.3秒。我们与丹麦癌症协会进行了访谈并展示了软件原型。该组织认为这类能民主化线上癌症信息的软件具有潜力,并预测未来此类系统将成为必需。