주요 메뉴 바로가기 보조 메뉴 바로가기 본문 바로가기

콘텐츠 본문

논문 해외 국제전문학술지(SCI급) Self-Attention-Based Masked Spectrogram Generation and Self-Supervised Learning Method for Improving Speech Emotion Recognition

1

논문 초록 (Abstract)

In this paper, we propose the Self-Attention-based Masked Spectrogram Generation (SAMSG) method to address the problem of model overfitting and improve generalization performance in speech emotion recognition under limited data conditions. A key challenge in many emotional speech datasets is that a small set of fixed sentences is repeatedly uttered with different emotional expressions, which can cause models to overfit to sentence-specific acoustic patterns rather than learn generalizable emotion-related features. To overcome this limitation, the proposed SAMSG method utilizes a pure self-attention-based model (DeiT) to obtain attention maps and applies the attention rollout technique to extract regions of high importance from time-frequency spectrograms. It then selectively masks only the regions that are important for emotion recognition, encouraging the model to learn complementary emotional information from less attended areas. This approach addresses the learning bias commonly seen in self-attention models, which tend to over-focus on localized regions of the input. The originality of the SAMSG method lies in its use of self-attention-driven masking, which—unlike conventional random masking—removes regions the model itself considers important, thereby promoting the learning of more diverse and robust emotional features. Our method alleviates overfitting without requiring external data or large-scale datasets, and achieves strong generalization even in data-constrained environments. Experiments conducted on the SAVEE, EmoDB, and CREMA-D datasets show that the proposed SAMSG method outperforms existing self-attention-based models, achieving accuracies of 94.44%, 96.30%, and 85.94%, respectively. It also attains macro-averaged F1-scores of 0.9401, 0.9692, and 0.8595, demonstrating consistent robustness across diverse emotional speech corpora.