关键词:
Accuracy
Deep learning
Datasets
Voice recognition
Classification
Acoustics
Sound
Robotics
Computer science
摘要:
Although research on Acoustic Scene Classification (ASC) is very close to, or even overshadowed by different popular research areas known as Automatic Speech Recognition (ASR), Speaker Recognition (SR) or Image Processing (IP), this field potentially opens up several distinct and meaningful application areas based on environment context detection. The challenges of ASC mainly come from different noise resources, various sounds in real-world environments, occurring as single sounds, continuous sounds or overlapping sounds. In comparison to speech, sound scenes are more challenging mainly due to their being unstructured in form and closely similar to noise in certain contexts. Although a wide range of publications have focused on ASC recently, they show task-specific ways that either explore certain aspects of an ASC system or are evaluated on limited acoustic scene datasets. Therefore, the aim of this thesis is to contribute to the development of a robust framework to be applied for ASC, evaluated on various recently published datasets, and to achieve competitive performance compared to the state-of-the-art systems. To do this, a baseline model is firstly introduced. Next, extensive experiments on the baseline are conducted to identify key factors affecting final classification accuracy. From the comprehensive analysis, a robust deep learning framework, namely the Encoder-Decoder structure, is proposed to address three main factors that directly affect an ASC system. These factors comprise low-level input features, high-level feature extraction methodologies, and architectures for final classification. Within the proposed framework, three spectrogram transformations, namely Constant Q Transform (CQT), gammatone filter (Gamma), and log-mel, are used to convert recorded audio signals into spectrogram representations that resemble two-dimensional images. These three spectrograms used are referred to as low-level input features. To extract high-level features from spectr