Audio Deepfake Detection With hybrid CNN and ViT

Daniel, Jennifer

Audio Deepfake Detection With hybrid CNN and ViT

Tools

Daniel, Jennifer (2025) Audio Deepfake Detection With hybrid CNN and ViT. Masters thesis, Dublin, National College of Ireland.

Preview	PDF (Master of Science) Download (1MB) \| Preview
Preview	PDF (Configuration Manual) Download (532kB) \| Preview

Abstract

Audio deepfakes, generated by automated speech synthesis and voice conversion software, pose a growing threat to digital safety, privacy and the integrity of various media. Audio deepfaking is less studied than visual deepfaking, despite may challenges faced under varied noisy environments. This study attempts to address this gap by examining the efficacy of four architectures for detecting synthetic audio: Convolutional Neural Networks (CNN), Convolutional Recurrent Neural Networks (CRNN), Mobile Vision Transformer (MobileViT), and the Patchout Spectrogram Transformer (PaSST).

This study trained and evaluated the models using the FakeAVCeleb dataset, supplemented with additional samples of hospital noise. Our models were tested in both clean (audio) and noisy environments, and in order to improve robustness we applied RandAugment to CNN and CRNN; RandAugment generates spectrogram distortions to improve sample diversity in training. The experimental data showed that while all models performed with high accuracy and F1-scores above 0.97 in clean audio content, measured performance markedly degraded in noisy inputs. Where CNN and CRNN fall below 0.56 F1-scores, the MobileViT and PaSST took slight performance drops with F1-scores above 0.60 and above 0.55 respectively.

The findings from this study highlight the sensitivity of modern detection systems to background noise, as well as the merits of transformer based architectures in real-world conditions. This study highlights the power of data augmentation and hybrid architectures to create strong and workable audio deepfake detection systems, using a systematic comparison of these models.

Item Type:	Thesis (Masters)
Supervisors:	Name Email Menghwar, Teerath Kumar UNSPECIFIED
Uncontrolled Keywords:	Audio deepfakes; deep learning; CNN; CRNN; MobileViT; PaSST; RandAugment; noisy environments
Subjects:	Q Science > QA Mathematics > Computer software T Technology > T Technology (General) > Information Technology > Computer software Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions:	School of Computing > Master of Science in Data Analytics
Depositing User:	Ciara O'Brien
Date Deposited:	01 Jul 2026 08:55
Last Modified:	01 Jul 2026 08:55
URI:	https://norma.ncirl.ie/id/eprint/9423

Actions (login required)

View Item