NORMA eResearch @NCI Library

Methodology for Automated Forensic Web Scraping of Pricing Information

Byrne, Peter (2023) Methodology for Automated Forensic Web Scraping of Pricing Information. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (1MB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (316kB) | Preview

Abstract

This research examines current methods around web scraping. The research proposes a Python based solution to carry out automated forensic web scraping with the main objective of scraping e-commerce pricing data. The methodology describes the use of Selenium and Beautiful Soup 4 libraries with MD5 hashing and automation via use of bash script and ‘cron’ scheduling in the Linux environment. Six existing Python libraries are extensively tested and compared with each other, using a sample number of websites across local virtual machine, Amazon Web Services (AWS), Microsoft Azure and Linode cloud platforms. The comparison experiment aims to answer the question as to whether there are significant differences in efficacy, amount of data-scraped and time taken, across the different test environments and libraries. The final proposed methodology incorporates two stages; a downloader and a parser, to acquire, store and extrapolate meaningful information from the website data. The methodology uses a supervised syntactic approach from a JSON configuration file.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Ul Mustafa, Raza
UNSPECIFIED
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QA Mathematics > Computer software > Computer Security
T Technology > T Technology (General) > Information Technology > Computer software > Computer Security
H Social Sciences > HF Commerce > Electronic Commerce
Divisions: School of Computing > Master of Science in Cyber Security
Depositing User: Tamara Malone
Date Deposited: 21 Oct 2024 16:55
Last Modified: 21 Oct 2024 16:55
URI: https://norma.ncirl.ie/id/eprint/7113

Actions (login required)

View Item View Item