국립한밭대학교 마이크로사이트

논문 해외 국제전문학술지(SCI급) Self-Attention-Based Masked Spectrogram Generation and Self-Supervised Learning Method for Improving Speech Emotion Recognition

학술지 구분 국제전문학술지(SCI급)
게재년월 2025-08
저자명 Jeong-Yoon Kim, Seung-Ho Lee
학술지명 IEEE Access
발행국가 해외
논문언어 외국어
전체저자수 2
논문 다운로드 링크(외부) https://doi.org/10.1109/ACCESS.2025.3599218
연구분야 공학 > 전자/정보통신공학
키워드 #vision transformer #self-supervised learning #masked spectrogram generation #self-attention-based #Speech emotion recognition

Self-Attention-Based_Masked_Spectrogram_Generation_and_Self-Supervised_Learning.pdf (1.82 MB)

논문 초록 (Abstract)

In this paper, we propose the Self-Attention-based Masked Spectrogram Generation (SAMSG) method to address the problem of model overfitting and improve generalization performance in speech emotion recognition under limited data conditions. A key challenge in many emotional speech datasets is that a small set of fixed sentences is repeatedly uttered with different emotional expressions, which can cause models to overfit to sentence-specific acoustic patterns rather than learn generalizable emotion-related features. To overcome this limitation, the proposed SAMSG method utilizes a pure self-attention-based model (DeiT) to obtain attention maps and applies the attention rollout technique to extract regions of high importance from time-frequency spectrograms. It then selectively masks only the regions that are important for emotion recognition, encouraging the model to learn complementary emotional information from less attended areas. This approach addresses the learning bias commonly seen in self-attention models, which tend to over-focus on localized regions of the input. The originality of the SAMSG method lies in its use of self-attention-driven masking, which—unlike conventional random masking—removes regions the model itself considers important, thereby promoting the learning of more diverse and robust emotional features. Our method alleviates overfitting without requiring external data or large-scale datasets, and achieves strong generalization even in data-constrained environments. Experiments conducted on the SAVEE, EmoDB, and CREMA-D datasets show that the proposed SAMSG method outperforms existing self-attention-based models, achieving accuracies of 94.44%, 96.30%, and 85.94%, respectively. It also attains macro-averaged F1-scores of 0.9401, 0.9692, and 0.8595, demonstrating consistent robustness across diverse emotional speech corpora.

영상처리/딥러닝/AR 연구실

콘텐츠 본문

논문 해외 국제전문학술지(SCI급) Self-Attention-Based Masked Spectrogram Generation and Self-Supervised Learning Method for Improving Speech Emotion Recognition

첨부 파일

Quick Menu