Verma, Smriti (2018) A Comparative Study of Oversampling Techniques on Binary Classification for Detecting Duplicate Advertisement. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (1MB) | Preview |
Abstract
The online marketplace has become a great platform for conducting business. Not only does it allow the users to find and buy desirable items easily, but also stages an area where the user can upload their refurbished products in search of a potential buyer. Due to ever increasing competition within the market, competitive sellers go to great lengths to ensure that their products are noticed. This results in sellers posting the same advertisement several times, using near-duplicate titles or using slightly altered descriptions.
This study proposes to build a dichotomous classifier that would spot such duplicate commercial advertisements that feature the same product. A Russian dataset of 3 million records was translated into English, for the better understanding of the results. The dataset was imbalanced with data samples for duplicate class less than the non-duplicate class.
This study compares the six oversampling techniques, Random oversampling, SMOTE, SMOTE-Borderline 1, SMOTE-Borderline 2, SVM SMOTE and ADA- SYN, used to achieve class balance in the dataset. Four classification models, Gradient Boosting Tree, Logistic Regression, Naive Bayes and SVM, are built, on top of the oversampling techniques, to identify the duplicate advertisements.
This study finds that the performance of classifiers improves with an increase in the sample size of the training data. The best performing model was SVM when paired with Borderline-SMOTE 2, with an F1 score of 0.9151
The proposed model will prevent the buyers from sifting through the dozens of deceptively identical advertisements, thereby expediting the search process. With more accurate duplicate ad detection, the model will enable the buyers to easily find a desirable product.
Item Type: | Thesis (Masters) |
---|---|
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science Q Science > QA Mathematics > Computer software T Technology > T Technology (General) > Information Technology > Computer software H Social Sciences > HF Commerce > Electronic Commerce |
Divisions: | School of Computing > Master of Science in Data Analytics |
Depositing User: | Caoimhe Ní Mhaicín |
Date Deposited: | 05 Nov 2018 10:02 |
Last Modified: | 05 Nov 2018 10:02 |
URI: | https://norma.ncirl.ie/id/eprint/3421 |
Actions (login required)
View Item |