NORMA eResearch @NCI Library

Open-source ETL Framework using Big Data tools Orchestration on AWS Cloud Platform

Sahoo, Sumit Kumar (2023) Open-source ETL Framework using Big Data tools Orchestration on AWS Cloud Platform. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (1MB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (4MB) | Preview

Abstract

The purpose of this study is to provide a comprehensive review of available open-source ETL(Extract Transform Load) frameworks that can be used to support big data platforms on the Amazon Web Services (AWS) cloud platform. The specific objectives of this review are to (1) identify and evaluate the features of popular open-source ETL frameworks, (2) compare the features of these frameworks with respect to Commercial ETL tools (3) provide recommendations on the best frameworks to use for big data platforms and orchestrating on AWS also considering the cost of Infrastructure. A review of the literature was conducted to identify the most popular open-source ETL frameworks. The frameworks that were identified include Apache NiFi, Apache Beam, Apache Kafka, and Apache Airflow with Pyspark. These frameworks were evaluated based on several criteria, including ease of use, support for multiple data sources and formats, support for multiple data processing engines, and support for cloud-based deployment. Based on the results of the evaluation, it’s concluded that its possible to save Infrastructure costs with a refined Cloud Solution Architecture which was improved while doing the development, and an Open Source ETL Big Data framework can be developed using Apache Airflow, Hive, and Pyspark. Terraform (Infrastructure as Code) is used to automate the entire infrastructure on Cloud and promotes the reproducibility of AWS resources with ease in changes.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Heeney, Sean
UNSPECIFIED
Uncontrolled Keywords: ETL; Opensource; Big Data; AWS orchestration; Cloud Cost Optimization
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Cloud computing
Divisions: School of Computing > Master of Science in Cloud Computing
Depositing User: Tamara Malone
Date Deposited: 19 Apr 2023 13:45
Last Modified: 19 Apr 2023 13:45
URI: https://norma.ncirl.ie/id/eprint/6486

Actions (login required)

View Item View Item