NORMA eResearch @NCI Library

Multilingual Toxicity Detection with Enhanced Balancing and Contextual Learning

Saju, Jacob (2025) Multilingual Toxicity Detection with Enhanced Balancing and Contextual Learning. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (860kB) | Preview
[thumbnail of Configuration Manual]
Preview
PDF (Configuration Manual)
Download (1MB) | Preview

Abstract

Online toxicity detection systems struggle immensely in scaling across multiple, diverse linguistic and cultural environments, frequently privileging high-resource languages and offering poor protection to low-resource languages speakers. This work presents MultiToxiGuard, a multilingual toxicity detection system that solves these problems using three new components: a Smart Balancing Module using hierarchical sampling and dynamic weighting, a Contextual Enhancement Layer leveraging cultural embeddings for enhanced semantic awareness, and a Confidence Estimation System that includes robust uncertainty estimation. Utilizing a dataset of 15 languages from 9 language families, rigorous data augmentation processes are implemented that greatly enhanced representation of low-resource languages (Japanese +1518%, Vietnamese +1208%). Results of the validation indicate high overall performance (F1=0.7944, accuracy=0.8278) with impressive uniformity spanning linguistic boundaries, and having a cultural fairness score of 0.96. Specifically, a few low-resource languages (Estonian, Swahili) performed better than medium-resource languages, highlighting the efficacy of these balancing techniques. Whereas performance objectives of F1 (≥0.88) and the rate of false positives (≤0.03) are still daunting, MultiToxiGuard is a major step forward in fair content moderation that closes the high to low-resource languages' performance gap, a sore problem of past techniques. This system presents a single, integrated framework for detection of toxicity which performs at a consistent rate without the need for distinct models per language, markedly improving the best available multilingual content moderation technologies.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Niculescu, Hamilton
UNSPECIFIED
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4150 Computer Network Resources > The Internet > World Wide Web > Websites > Online social networks
T Technology > TK Electrical engineering. Electronics. Nuclear engineering > Telecommunications > The Internet > World Wide Web > Websites > Online social networks
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Ciara O'Brien
Date Deposited: 18 Nov 2025 17:47
Last Modified: 18 Nov 2025 17:47
URI: https://norma.ncirl.ie/id/eprint/8945

Actions (login required)

View Item View Item