Gilheany, Erin (2016) Processing time of TFIDF and Naive Bayes on Spark 2.0, Hadoop 2.6 and Hadoop 2.7: Which Tool Is More Efficient? Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (874kB) | Preview |
Preview |
PDF (Configuration File)
Download (2MB) | Preview |
Abstract
There has been a large emphasis placed on the performance variations which occur when comparing Hadoop and Spark. This research paper will dive into the details of this comparison using TD-IDF and Naive Bayes algorithms on both applications to demonstrate the total processing time differences. It has been noted in literature that Spark goes a long way towards dealing with the limitations of Hadoop, in particular those issues which frequently arise in the application of iterative machine learning algorithms due to the slow processing of inputs/outputs to disc. This paper explores the difference from a text categorisation stand point. On a single computer there is a strong distinction in computing times of TFIDF on Spark Versus Hadoop, with Spark completing the application in a fraction of the time. Naive Bayes shows a contrasting picture with Spark's processing speeds on average twice as big as that of Hadoop. Given the additional costs of RAM on Spark, in this instance Hadoop would appear to be the better choice.
Item Type: | Thesis (Masters) |
---|---|
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science Q Science > QA Mathematics > Computer software T Technology > T Technology (General) > Information Technology > Computer software |
Divisions: | School of Computing > Master of Science in Data Analytics |
Depositing User: | Caoimhe Ní Mhaicín |
Date Deposited: | 03 Dec 2016 11:54 |
Last Modified: | 03 Dec 2016 14:49 |
URI: | https://norma.ncirl.ie/id/eprint/2490 |
Actions (login required)
View Item |