da Silva de Oliveira, Priscila Cristina (2024) From LDA to BERTopic: Evaluating Topic Modelling Methods for Aviation Safety Reports in Brazilian Portuguese. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (2MB) | Preview |
Preview |
PDF (Configuration Manual)
Download (721kB) | Preview |
Abstract
This study applies five different topic models to aviation safety reports in Brazilian Portuguese. The techniques explored are Latent Dirichlet Allocation (LDA), LDA with stemming, a cross-language model which translates the texts to English and then perform LDA, word2vec with k-means and BERTopic. The research aims to explore the dataset that was not previously used in published research and evaluate how effective the approaches applied are in identifying topics withing the corpus of reports. BERTopic outperformed the other models achieving a coherence score of 0.4819. A composite score was calculated based on the coherence and perplexity scores and used to evaluate the LDA models. LDA with stemming demonstrated the best composite score. Furthermore, Word2Vec with k-means might be a better approach for more generalised classifications.
Item Type: | Thesis (Masters) |
---|---|
Supervisors: | Name Email Haycock, Barry UNSPECIFIED |
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science H Social Sciences > HD Industries. Land use. Labor > Specific Industries > Aviation Industry P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning |
Divisions: | School of Computing > Master of Science in Data Analytics |
Depositing User: | Ciara O'Brien |
Date Deposited: | 15 Aug 2025 17:10 |
Last Modified: | 15 Aug 2025 17:10 |
URI: | https://norma.ncirl.ie/id/eprint/8549 |
Actions (login required)
![]() |
View Item |