Clustering of mixed datasets using deep learning algorithm

Balaji K.; Lavanya K.; Geetha Mary A

doi:10.1016/j.chemolab.2020.104123

The performance of a clustering algorithm is highly dependent on the quality and quantity of the training dataset. Deep learning is one of the most popular and successful technique for clustering of datasets with high quality. Typically, most of the datasets contain mixed numeric and categorical data attributes. The clustering of such different types of data is a complex issue. Deep learning methods, the state-of-the-art classifiers, with better learning procedures and computational resources, can fill these gaps. To improve the robustness of clusters, we propose a Constraint-Based Deep Convolutional Generative Adversarial Network (CB-DCGANs) framework for generating simulated data to augment the training set to improve the performance of the clustering algorithm. We evaluated the performance of an end-to-end Deep Convolutional Neural Network (DCNN) in detecting the clusters from given datasets. The results from CB-DCGANs with DCNN yielded baseline accuracies of 0.8853 for heart disease dataset. In chemoinformatics datasets proposed algorithm yielded accuracies of 0.965 for kaggle dataset, 0.987 for factors dataset, 0.952 for kinase dataset. This study shows that using generative adversarial networks for clustering augmentation can significantly improve performance, especially in real-life applications.

Journal	Data powered by TypesetChemometrics and Intelligent Laboratory Systems
Publisher	Data powered by TypesetElsevier BV
Open Access	No