콘텐츠 본문
논문 해외 국제전문학술지(SCI급) Self-Attention-Based Masked Spectrogram Generation and Self-Supervised Learning Method for Improving Speech Emotion Recognition
- 학술지 구분 국제전문학술지(SCI급)
- 게재년월 2025-08
- 저자명 Jeong-Yoon Kim, Seung-Ho Lee
- 학술지명 IEEE Access
- 발행국가 해외
- 논문언어 외국어
- 전체저자수 2
- 논문 다운로드 링크(외부) https://doi.org/10.1109/ACCESS.2025.3599218
- 연구분야 공학 > 전자/정보통신공학
- 키워드 #vision transformer #self-supervised learning #masked spectrogram generation #self-attention-based #Speech emotion recognition
논문 초록 (Abstract)
In this paper, we propose the Self-Attention-based Masked Spectrogram Generation (SAMSG) method to address the problem of model overfitting and improve generalization performance in speech emotion recognition under limited data conditions. A key challenge in many emotional speech datasets is that a small set of fixed sentences is repeatedly uttered with different emotional expressions, which can cause models to overfit to sentence-specific acoustic patterns rather than learn generalizable emotion-related features. To overcome this limitation, the proposed SAMSG method utilizes a pure self-attention-based model (DeiT) to obtain attention maps and applies the attention rollout technique to extract regions of high importance from time-frequency spectrograms. It then selectively masks only the regions that are important for emotion recognition, encouraging the model to learn complementary emotional information from less attended areas. This approach addresses the learning bias commonly seen in self-attention models, which tend to over-focus on localized regions of the input. The originality of the SAMSG method lies in its use of self-attention-driven masking, which—unlike conventional random masking—removes regions the model itself considers important, thereby promoting the learning of more diverse and robust emotional features. Our method alleviates overfitting without requiring external data or large-scale datasets, and achieves strong generalization even in data-constrained environments. Experiments conducted on the SAVEE, EmoDB, and CREMA-D datasets show that the proposed SAMSG method outperforms existing self-attention-based models, achieving accuracies of 94.44%, 96.30%, and 85.94%, respectively. It also attains macro-averaged F1-scores of 0.9401, 0.9692, and 0.8595, demonstrating consistent robustness across diverse emotional speech corpora.