A Deep Neural Network Approach Integrating CNN and BiLSTM-Transformer Architectures for Emotion Recognition from Speech

Anand, Divyansh

A Deep Neural Network Approach Integrating CNN and BiLSTM-Transformer Architectures for Emotion Recognition from Speech

Tools

Anand, Divyansh (2024) A Deep Neural Network Approach Integrating CNN and BiLSTM-Transformer Architectures for Emotion Recognition from Speech. Masters thesis, Dublin, National College of Ireland.

Preview	PDF (Master of Science) Download (2MB) \| Preview
Preview	PDF (Configuration Manual) Download (1MB) \| Preview

Abstract

The goal of speech emotion recognition is to make human-computer interaction more efficient in several areas such as customer service, entertainment industry, human-computer interaction, healthcare, and education. Previous work in speech emotion analysis presented some issues like limited choice of features, model complexity, noise variability, and insufficient data samples, which negatively affected the prediction of emotions. This paper provides an in-depth study of speech emotion recognition using a hybrid deep neural network architecture that combines 1-D Convolutional Neural Network (CNN) and BiLSTM-Transformer models to analyze data from the Ravdess and Crema-D datasets. To make the datasets appropriate for emotion detection, all were preprocessed by means of librosa library to get rid of non-speech segments. Important sound characteristics such as Mel-Frequency Cepstral Coefficients(MFCC), Root Mean Square Energy (RMSE), and Zero Crossing Rate (ZCR) were extracted to get the spectral characteristics, intensity of feelings, and dynamic features present in the emotions. In order to improve model’s generalization and robustness noise injection, time stretching, time shifting as well as pitch shifting have been applied during data augmentation. The proposed model leverages the strengths of both CNN and BiLSTM-Transformer components. The proposed model’s 1-D CNN captures local patterns in sound whereas the BiLSTM-Transformer handles sequence data and complex hierarchical structures of audio. Various datasets such as Ravdess and Crema-D were used to train and test the performance of the model in the emotion classification task. To evaluate model’s performance, training-validation accuracy graph, confusion matrix and overall metrics which include precision, recall, and F1-score are used. Ravdess dataset achieved a high accuracy of 83.3%, surprise, angry, disgust and sad were among those emotions which this model identified with great accuracy. Crema-D dataset achieved 82.7% accuracy, and showed solid performance in detecting neutral, fear, and happy emotions. Accuracy plots between training and validation demonstrated good generalization for the unseen data and confusion matrices highlighted the emotion categories where improvement could be made.

Item Type:	Thesis (Masters)
Supervisors:	Name Email Qayum, Abdul UNSPECIFIED
Uncontrolled Keywords:	Speech Emotion Detection; Convolutional Neural Network; Bidirectional Long Short-Term Memory; Transformer; Ravdess; Crema-D
Subjects:	Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science B Philosophy. Psychology. Religion > Psychology > Emotions Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions:	School of Computing > Master of Science in Data Analytics
Depositing User:	Ciara O'Brien
Date Deposited:	07 Aug 2025 08:27
Last Modified:	07 Aug 2025 08:27
URI:	https://norma.ncirl.ie/id/eprint/8453

Actions (login required)

View Item