Velamuri, Vidya Sankar (2016) Supervised Unsupervised Learning in Spark. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (836kB) | Preview |
Preview |
PDF (Configuration File)
Download (927kB) | Preview |
Abstract
Clustering is the unsupervised classification of patterns into groups and is one of the most popular techniques applied to explore and discover naturally occurring patterns within hitherto unlabelled data. The quality of the clusters resulting from a clustering algorithm can be verified using clustering validity indices, which take into account the intra cluster similarity and inter cluster separation of the clusters. However in a distributed setting the computation of pairwise distances between data points of a large data set distributed across the cluster can be computationally very expensive.
This research proposes to evaluate a sampling based approach to computing the cluster validity indices for distributed datasets and embed this methodology into a model selection pipeline that evaluates distributed machine learning jobs in selecting an optimal clustering algorithm. The results suggest the sampling error of the internal validation index so computed is statistically significant.
Item Type: | Thesis (Masters) |
---|---|
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science Q Science > QA Mathematics > Computer software T Technology > T Technology (General) > Information Technology > Computer software |
Divisions: | School of Computing > Master of Science in Data Analytics |
Depositing User: | Caoimhe Ní Mhaicín |
Date Deposited: | 03 Dec 2016 14:42 |
Last Modified: | 03 Dec 2016 14:42 |
URI: | https://norma.ncirl.ie/id/eprint/2502 |
Actions (login required)
View Item |