Prakash, Likhitha Konasale (2025) Enhancing Climate Change Stance Detection Through Advanced Synthetic Data Augmentation. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (774kB) | Preview |
Preview |
PDF (Configuration Manual)
Download (4MB) | Preview |
Abstract
Climate change stance identification seeks to classify social media messages automatically into various viewpoint groups on climate change, commonly separating those who believe in climate science, those who refuse, and those who are neutral. The study proposes a sophisticated synthetic data augmentation system to enhance the accuracy of social media stance identification, especially for minority under represented opinions. The main goal is to fix the big class imbalance in climate debate data, where minority opinions are often less than 12% of the training data and models can't find these important opinions. This work shows that synthetic data generation can be used to balance training datasets. For example, the Twitter Climate Change Sentiment Dataset has only 11.51% of samples that are against climate change.
The paper suggests a general augmentation framework built on OpenAI's GPT-4.1 Mini. It includes three main new ideas: stance-adapted generation strategies based on linguistic analysis of climate discourse, a parallel processing architecture that runs 60+ samples per minute, and a five-layer validation system to make sure the quality of the synthetic data. Ten specific strategies were developed through careful linguistic analysis to make real samples for under-represented anti- and neutral stances. Validation tests on seven models showed big improvements. The best model, RoBERTa, was 88.92% accurate and improved the identification of minority classes by 47%. The system made 20,000 high-quality synthetic instances out of 41,000 tries, which changed the dataset's anti-stance representation from 11.51% to 25.59%.
These experiments provide a pragmatic solution to the problem of class imbalance for stance detection and a theoretical advance towards synthetic data generation for ideologically sensitive tasks. The approach can be generalized further to other types of polarized discourse where minority perspective identification is still essential to public opinion dynamics understanding.
Actions (login required)
![]() |
View Item |
Tools
Tools