NORMA eResearch @NCI Library

AI-Driven Cloud Optimization: Enhancing Cost Prediction, Resource Scheduling and Fault Resilience in Cloud Environments

Bhaskaran, Ranjith, Muntean, Cristina Hava and Gupta, Shaguna (2025) AI-Driven Cloud Optimization: Enhancing Cost Prediction, Resource Scheduling and Fault Resilience in Cloud Environments. In: 2025 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, Shenzhen, China. ISBN 979-833156634-0

Full text not available from this repository.
Official URL: https://doi.org/10.1109/CloudCom67567.2025.1133137...

Abstract

Cloud computing has the benefits of scalability and flexibility, yet poses long-term problems of cost estimation, efficient scheduling of resources, and fault tolerance. In this paper, an AI-driven framework is proposed that can reconcile these drawbacks by combining cost prediction, dynamic task scheduling, and fault detection into a user-friendly visualization dashboard. Cost prediction makes use of supervised machine learning algorithms such as Linear Regression, Random Forest, and XGBoost to predict the costs of a task based on synthetic workloads created with iFogSim. The prediction accuracy is also improved after hyperparameter optimization using Optuna. Task scheduling employs Deep Reinforcement Learning (DRL) with a Deep Q-Network (DQN) structure that maximizes job placement on heterogeneous virtual machines (VMs) and has benchmark comparisons with First-Come-First-Serve (FCFS) and Round-Robin schedules. The scheduling logic is trained and tested on the Kaggle Cloud Task Scheduling dataset. The fault detection mechanism uses the Isolation Forest algorithm to detect anomalous system behavior such as CPU usage behavior or long execution time. Evaluation metrics, reward curves, anomaly plots, and interpretability graphs, are displayed as part of a Streamlit-based dashboard on Render. The framework is a modular automation constructed to stage each aspect on demand, making it flexible, reproducible, and resilient in deployment. Experimental results show that such a technique makes cost estimation more accurate, minimizes delays in scheduling, and increases fault tolerance. This makes the proposed framework holistic and practical, since predictive analytics is combined with reinforcement learning along with anomaly detection, to optimise operations in multi-cloud environments. The outcome of this research can be of interest for real-life cloud management applications.

Item Type: Book Section
Uncontrolled Keywords: AI Scheduling; Cloud Optimization; Cloud Simulation; Cost Prediction; Fault Tolerance
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence
T Technology > T Technology (General) > Information Technology > Cloud computing
Divisions: School of Computing > Staff Research and Publications
Depositing User: Tamara Malone
Date Deposited: 16 Apr 2026 14:02
Last Modified: 16 Apr 2026 14:02
URI: https://norma.ncirl.ie/id/eprint/9286

Actions (login required)

View Item View Item