NORMA eResearch @NCI Library

ML-Powered Cloud Task Failure Prediction and Scalable Deployment on AWS

Satheeshkumar, Harikrishnan (2025) ML-Powered Cloud Task Failure Prediction and Scalable Deployment on AWS. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (681kB) | Preview
[thumbnail of Configuration Manual]
Preview
PDF (Configuration Manual)
Download (336kB) | Preview

Abstract

Predicting task failures in large-scale cloud environments is critical for improving system reliability and managing Service Level Agreements. This project presents an end-to-end machine learning system for predicting cloud task failures using historical workload data from Google Borg traces. A Random Forest Classifier was trained to distinguish between failing and succeeding tasks based on features such as resource requests, memory usage, and CPU cycles. The resulting model was operationalized by building a Flask-based REST API that dynamically loads the model from Amazon S3. For deployment, the application was containerized using Docker and orchestrated on AWS using ECS Fargate, ensuring a serverless and scalable execution environment for the prediction service. The system's endpoint is exposed via an Application Load Balancer, with an attached Auto Scaling policy based on CPU utilization to handle variable prediction request loads, ensuring the API itself remains responsive. This work demonstrates a complete, production-ready pipeline for deploying a real-time, scalable, ML-powered classification service in a cloud-native fashion.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Kazmi, Aqeel
UNSPECIFIED
Uncontrolled Keywords: Machine Learning; Failure Prediction; Cloud Computing; AWS; ECS Fargate; Docker; Random Forest; Auto Scaling; REST API; Google Borg Dataset
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Cloud computing
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions: School of Computing > Master of Science in Cloud Computing
Depositing User: Ciara O'Brien
Date Deposited: 30 Mar 2026 14:11
Last Modified: 30 Mar 2026 14:11
URI: https://norma.ncirl.ie/id/eprint/9258

Actions (login required)

View Item View Item