Enhancing Water Use Data Analysis in Cloud Computing Environments through Parallel Processing Optimization

Joseph, Priyanka

Enhancing Water Use Data Analysis in Cloud Computing Environments through Parallel Processing Optimization

Tools

Joseph, Priyanka (2023) Enhancing Water Use Data Analysis in Cloud Computing Environments through Parallel Processing Optimization. Masters thesis, Dublin, National College of Ireland.

Preview	PDF (Master of Science) Download (392kB) \| Preview
Preview	PDF (Configuration Manual) Download (281kB) \| Preview

Abstract

Water quality analysis is crucial for public health, ecological balance, and sustainable development. This research aims to perform water quality assessment using the National Water Quality Monitoring Programme (NWMP) dataset from India's Central Pollution Control Board (CPCB) ensuring the availability of safe drinking water and preserving water resources for ecological well-being. The significance is that predictive modeling and in the analysis is aimed at generating valuable information for data-driven decision-making regarding water resources by policy makers and scientists. The study utilizes PySpark, a Python-based Apache Spark framework, and machine learning algorithms on Amazon Web Services (AWS) SageMaker to process the large-scale dataset. The prime objectives are to effectively handle the expansive data, incorporate advanced analytics, and provide actionable insights for water resource management. The methodology contributes to the rearch by integrating PySpark for distributed data processing, applying linear and logistic regression models from Spark's machine learning library (MLlib) for predictive modeling, and leveraging AWS simple storage service (S3) for storage and AWS Glue for serverless integration. The study analyzes relationships in the parameters like dissolved oxygen, pH, conductivity to accurately estimate the Water Quality Index (WQI), which is an indicator of the consumable water quality. Linear regression model made predictions on the dataset and achieved a model accuracy of about 97%, while the logistic regression performed a litter better in classifying the water quality into multiple categories (Poor, Good, Unsuitable) with an accuracy of 99%. The findings will enable policymakers, water managers, and scientists to make informed decisions regarding sustainable water resource management. Overall, this research demonstrates a scalable, cloud-based approach combining PySpark, ML, and AWS for efficient large-scale water quality analysis.

Item Type:	Thesis (Masters)
Supervisors:	Name Email Makki, Ahmed UNSPECIFIED
Uncontrolled Keywords:	Water quality; CDWR; machine learning; PySpark; AWS
Subjects:	G Geography. Anthropology. Recreation > GE Environmental Sciences Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Cloud computing Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions:	School of Computing > Master of Science in Cloud Computing
Depositing User:	Ciara O'Brien
Date Deposited:	28 Mar 2025 13:48
Last Modified:	06 May 2025 16:54
URI:	https://norma.ncirl.ie/id/eprint/7348

Actions (login required)

View Item