Khan, Muhammad Arsil (2025) Dynamic MLOps Pipeline and Retraining Strategy Optimization for LLMs in Multi-Cloud Environments for AI Deployments. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (1MB) | Preview |
Preview |
PDF (Configuration Manual)
Download (730kB) | Preview |
Abstract
Enterprises increasingly rely on continuous delivery of machine learning models to power critical applications. However, retraining the models too frequently leads to significant cloud resource consumption and infrequent updates can lead to performance degradation. In this study an adaptive MLOps pipeline is presented that leverages reinforcement learning using Proximal Policy Optimization to determine optimal retraining schedules for a DistilGPT-2 model deployed on multi-cloud environments (AWS and Azure). We use multi-cloud environment to balance GPU-hour costs, latency and resource availability, ensuring both cost efficiency and low inference latency. The system collects live metrics CPU and memory usage, latency, model accuracy etc via Prometheus and converts them into state vectors. The agent learn a reward function that weights quality improvements against cost, dynamically retrains only when the deliver measurable benefits. In four hour experiment handling 200 requests per seconds the adaptive pipeline reduced retaining events by 75% and increased average BLEU-1 score by 0.15 points and showed improvements in latency as compared to fixed interval baselines. These results demonstrate that a PPO-based reinforcement learning can significantly reduce resource utilization while preserving or improving model performance. This paper offers a practical framework for self improving, cost effective ML operations in multi-cloud environment.
| Item Type: | Thesis (Masters) |
|---|---|
| Supervisors: | Name Email Makki, Ahmed UNSPECIFIED |
| Subjects: | Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence T Technology > T Technology (General) > Information Technology > Cloud computing Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning |
| Divisions: | School of Computing > Master of Science in Cloud Computing |
| Depositing User: | Ciara O'Brien |
| Date Deposited: | 26 Mar 2026 14:27 |
| Last Modified: | 26 Mar 2026 14:27 |
| URI: | https://norma.ncirl.ie/id/eprint/9227 |
Actions (login required)
![]() |
View Item |
Tools
Tools