Byrne, Peter (2023) Methodology for Automated Forensic Web Scraping of Pricing Information. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (1MB) | Preview |
Preview |
PDF (Configuration manual)
Download (316kB) | Preview |
Abstract
This research examines current methods around web scraping. The research proposes a Python based solution to carry out automated forensic web scraping with the main objective of scraping e-commerce pricing data. The methodology describes the use of Selenium and Beautiful Soup 4 libraries with MD5 hashing and automation via use of bash script and ‘cron’ scheduling in the Linux environment. Six existing Python libraries are extensively tested and compared with each other, using a sample number of websites across local virtual machine, Amazon Web Services (AWS), Microsoft Azure and Linode cloud platforms. The comparison experiment aims to answer the question as to whether there are significant differences in efficacy, amount of data-scraped and time taken, across the different test environments and libraries. The final proposed methodology incorporates two stages; a downloader and a parser, to acquire, store and extrapolate meaningful information from the website data. The methodology uses a supervised syntactic approach from a JSON configuration file.
Item Type: | Thesis (Masters) |
---|---|
Supervisors: | Name Email Ul Mustafa, Raza UNSPECIFIED |
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science Q Science > QA Mathematics > Computer software > Computer Security T Technology > T Technology (General) > Information Technology > Computer software > Computer Security H Social Sciences > HF Commerce > Electronic Commerce |
Divisions: | School of Computing > Master of Science in Cyber Security |
Depositing User: | Tamara Malone |
Date Deposited: | 21 Oct 2024 16:55 |
Last Modified: | 21 Oct 2024 16:55 |
URI: | https://norma.ncirl.ie/id/eprint/7113 |
Actions (login required)
View Item |