NORMA eResearch @NCI Library

A Comparative Study of Oversampling Techniques on Binary Classification for Detecting Duplicate Advertisement

Verma, Smriti (2018) A Comparative Study of Oversampling Techniques on Binary Classification for Detecting Duplicate Advertisement. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
PDF (Master of Science)
Download (1MB) | Preview


The online marketplace has become a great platform for conducting business. Not only does it allow the users to find and buy desirable items easily, but also stages an area where the user can upload their refurbished products in search of a potential buyer. Due to ever increasing competition within the market, competitive sellers go to great lengths to ensure that their products are noticed. This results in sellers posting the same advertisement several times, using near-duplicate titles or using slightly altered descriptions.

This study proposes to build a dichotomous classifier that would spot such duplicate commercial advertisements that feature the same product. A Russian dataset of 3 million records was translated into English, for the better understanding of the results. The dataset was imbalanced with data samples for duplicate class less than the non-duplicate class.

This study compares the six oversampling techniques, Random oversampling, SMOTE, SMOTE-Borderline 1, SMOTE-Borderline 2, SVM SMOTE and ADA- SYN, used to achieve class balance in the dataset. Four classification models, Gradient Boosting Tree, Logistic Regression, Naive Bayes and SVM, are built, on top of the oversampling techniques, to identify the duplicate advertisements.

This study finds that the performance of classifiers improves with an increase in the sample size of the training data. The best performing model was SVM when paired with Borderline-SMOTE 2, with an F1 score of 0.9151

The proposed model will prevent the buyers from sifting through the dozens of deceptively identical advertisements, thereby expediting the search process. With more accurate duplicate ad detection, the model will enable the buyers to easily find a desirable product.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QA Mathematics > Computer software
T Technology > T Technology (General) > Information Technology > Computer software
H Social Sciences > HF Commerce > Electronic Commerce
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Caoimhe Ní Mhaicín
Date Deposited: 05 Nov 2018 10:02
Last Modified: 05 Nov 2018 10:02

Actions (login required)

View Item View Item