In this paper, we have implemented a unique form of clustering that takes a non-numeric data set and clusters it with the help of the word embedding provided by the GloVe dataset. The related word embedding are generated for each of the items in the dataset we want to cluster using the GloVe vector representation of those words. We then perform dimensionality reduction on the data set to obtain the accurate number of dimensions to be taken for appropriate cluster formation. The data is then clustered using k-means++. This paper provides one of the ways to overcome the limitation of k-means clustering in terms of initialising the cluster centres and hence gives better quality clusters. With the synthetic examples, the k-means method does not perform well, because the random seeding inevitably merges clusters together, and the algorithm is unable to then split them apart. Careful seeding method used by k-means++ prevents this problem and hence usually gives optimal results even when datasets are synthetic. © 2017 IEEE.

Balakrushna Tripathy

Department of Analytics

School of Computer Science and Engineering

Vellore Campus

A Gupta

Vellore Institute of Technology (VIT) is a private university located in&nbsp;Tamil Nadu, India. Founded in 1984, as Vellore Engineering College, the institution offers 20 undergraduate, 34 postgraduate, four integrated and four research programs. It has campuses in Vellore, Amravati, Bhopal and Chennai.

VIT is one of the top ranked private universities in India according to NIRF, THE and QS Rankings.&nbsp;Govt. of India has recognized&nbsp;VIT, Vellore as an&nbsp;Institution of Eminence. This has allowed VIT to take independent quality initiatives and move up in world ranking.

&nbsp;

&nbsp;

VIT University

2017 International Conference on Intelligent Sustainable Systems (ICISS)

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Modern Technologies for Big Data Classification and Clustering Advances in Data Mining and Database Management

Uncertainty-Based Clustering Algorithms for Large Data Sets

The practice of using divide and conquer techniques to solve complex, time-consuming problems has been in use for a very long time. Here we evaluate the performance of centroid-based clustering techniques, specifically k-means and its two approximation algorithms, the k-means++ and k-means (also known as Scalable k-means++), as divide and conquer paradigms applied for the creation of minimum spanning trees. The algorithms will be run on different datasets to get a good evaluation of their respective performances. This is a continuation of our previous work carried out in developing the KMST+ algorithm in the context of fast minimum spanning tree (FMST) frameworks. © 2017 IEEE.

2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)

Comparison of centroid-based clustering algorithms in the context of divide and conquer paradigm based FMST framework

Clustering is categorised as hard or soft in nature. Soft clusters may have fuzzy or rough boundaries. Rough clustering can help researchers to discover overlapping clusters in many applications such as web mining and text mining. Rough set approach is a very useful tool to handle the unclear and ambiguous data. As rough sets make use of the equivalence relation property, they remain rigid and it is unreliable and inefficient for real time applications where the datasets may be very large. In this paper, we provide a solution to this problem with covering-based rough set approach. Covering-based rough set is an extension of rough set approach in which the equivalence relation has been relaxed. This method is based on coverings rather than partitions. This makes it more flexible than rough sets and it is more convenient for dealing with complex applications. Clustering sequential data is one of the vital research tasks. We uses covering-based similarity measure which gives better results as compared to rough set which uses set and sequence similarity measure. In this paper, covering-based rough fuzzy set clustering approach is proposed to resolve the uncertainty of sequence data. © 2015 Inderscience Enterprises Ltd.

International Journal of Reasoning-based Intelligent Systems

An integrated covering-based rough fuzzy set clustering approach for sequential data

From the beginning of the data analysis system cluster computing plays an important role on it. The very early developed clustering algorithms which can handle only numerical data and K-means clustering is one of them and was proposed by Macqueen [1] in 1967. This algorithm helps us to find the homogeneity of the data set. This K-means algorithm has been modified in many ways to get the modified K-means and kernel based K-means is one of them. It is a nonlinear transformation which transforms the sample data into high dimensional feature space. Though this kernel based K-means performs good almost on every data set but it is unable to handle uncertainty. After rough set theory has been proposed by Pawlak [2], we have many clustering algorithms based on it which can handle uncertainty and heterogeneous data and Rough based K-means is one of them. So in this paper we are proposing the combination of these two methods and known as kernel based K-Means using rough set. © 2012 IEEE.

2012 International Conference on Computer Communication and Informatics

Kernel based K-means clustering using rough set

Several cluster analysis techniques have been developed till the present to group objects having similar property or similar characteristics and K-means clustering is one of the most popular statistical clustering techniques proposed by Macqueen [12] in 1967. But this algorithm is unable to handle the categorical data and unable to handle uncertainty as well. But after proposing the rough set theory by Pawlak [15], we have an alternative way of representing sets whose exact boundary cannot be described due to incomplete information. As rough set has been widely used for knowledge representation, hence it can also be applied in classification and very helpful in clustering too. In real life data mining applications we do not have the crisp boundaries for clusters. So, in 2007 and 2009 Parmar et al [14] and Tripathy et al [16] proposed two algorithms MMR and MMeR using rough set theory but these two algorithms have the stability problem due to multiple runs and higher time complexity. In this paper we are proposing a new approach of k-means algorithm using rough set which can handle heterogeneous data and uncertainty as well. © Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2012.

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Advances in Computer Science and Information Technology. Networks and Communications

Journal	Data powered by Typeset2017 International Conference on Intelligent Sustainable Systems (ICISS)
Publisher	Data powered by TypesetIEEE
Open Access	0