NORMA eResearch @NCI Library

Performance Based Data-Distribution Methodology In Heterogeneous Hadoop Environment

Ubarhande, Vrushali (2014) Performance Based Data-Distribution Methodology In Heterogeneous Hadoop Environment. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
PDF (Master of Science)
Download (707kB) | Preview


Hadoop has been developed to process the data-intensive applications. However, the current data-distribution methodologies are inefficient for heterogeneous environment
such as cloud computing. Performance of Hadoop may degrade in heterogeneous environment, whenever data-distribution is not as per the computing capability of the nodes. In this work, the existing research methodologies have been critically evaluated to understand the current data-distribution techniques developed.

In Hadoop framework, users specify the application computation logic in terms of a map and a reduce function, often termed as MapReduce applications. Hadoop distributed
file system is used to store the MapReduce application data on the Hadoop cluster nodes called as Datanodes, whereas Namenode is a control point for all Datanodes.

The concept of data-locality and its impact on the performance of Hadoop are discussed. The data-distribution is a key factor in Hadoop, because it may affect performance in Map phase for scheduling task. The task scheduling techniques in Hadoop consider the data-locality as a key factor to enhance performance. Various task scheduling techniques have been analyzed to understand the affirmative effect of the emphasizing high data-locality while scheduling. Other system factors also play a major role while achieving high performance in Hadoop data processing.

The main contribution of this work is to prove a performance increase in Hadoop by the effective distribution of data in heterogeneous environment. An experiment has been proposed to adopt novel data placement strategy based on Datanodes's capability in Hadoop. In this experiment, Speed Analyser component is created to measure the processing capability of each Datanode. Thus, Speed Analyser calculates the computing ratio of each Datanode based on their response times. The Data-Distribution Technique is integrated with the traditional Hadoop and uses the calculated computing ratio. Based on the computing ratio, Namenode decides the assignment of the data blocks to
the Datanodes.

Thereafter, two MapReduce applications were executed to understand the performance improvement after the implementation of the proposed Data-Distribution Technique.
Further, the future scope for improving the proposed solution is identified.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Divisions: School of Computing > Master of Science in Cloud Computing
Depositing User: Caoimhe Ní Mhaicín
Date Deposited: 12 Dec 2014 11:37
Last Modified: 12 Dec 2014 11:38

Actions (login required)

View Item View Item