CLUST - Grouping aware data placement for improving the performance of large-scale data management system

Vengadeswaran Shanmugasundaram; S.R. Balasundaram

doi:10.1145/3371158.3371159

Profiles Research Units Publications

Conferences

CLUST - Grouping aware data placement for improving the performance of large-scale data management system

, S.R. Balasundaram

Published in Association for Computing Machinery

2020

DOI: 10.1145/3371158.3371159

Pages: 1 - 9

Abstract

Currently most applications are data-intensive in nature and require the ability to process large data sets across a cluster of nodes. The Hadoop-an open-source implementation of MapReduce (MR) architecture has become the de facto processing platform for these applications. Even though Hadoop is considered as an ideal solution to analyse and gain insights from massive data, it has its own limitations when the data to be processed exhibits interest-locality (i.e. the data required for any query execution follows grouping behaviour wherein only a part of big data is accessed frequently). Since Hadoop data placement does not consider interest-locality, the dependent blocks required for execution may be concentrated within fewer computing nodes, resulting in severe degradation in MR performance. Hence in this paper, CLUST- Optimal data placement strategy based on grouping semantics is proposed, so that the query can be solved earlier. This paper harnesses the Hierarchical agglomerative clustering techniques in data placement for achieving improved MR performance during execution of interest-based queries. It has been validated by executing complex interest-based queries on NCDC weather dataset, distributed in two scalable heterogeneous Hadoop clusters deployed on the cloud. The CLUST significantly reduces execution time, improves data locality, CPU utilisation and proves to be an efficient solution for big data processing. © 2020 Association for Computing Machinery.