TY - GEN
T1 - Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition
AU - Sun, Mo
AU - Li, Jian
AU - Feng, Hui
AU - Gou, Wei
AU - Shen, Haifeng
AU - Tang, Jian
AU - Yang, Yi
AU - Ye, Jieping
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/10/21
Y1 - 2020/10/21
N2 - This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.
AB - This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.
KW - audio-video based emotion recognition
KW - group-level emotion recognition
KW - multi-model
UR - http://www.scopus.com/inward/record.url?scp=85096652998&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096652998&partnerID=8YFLogxK
U2 - 10.1145/3382507.3417971
DO - 10.1145/3382507.3417971
M3 - Conference contribution
AN - SCOPUS:85096652998
T3 - ICMI 2020 - Proceedings of the 2020 International Conference on Multimodal Interaction
SP - 835
EP - 840
BT - ICMI 2020 - Proceedings of the 2020 International Conference on Multimodal Interaction
PB - Association for Computing Machinery, Inc
T2 - 22nd ACM International Conference on Multimodal Interaction, ICMI 2020
Y2 - 25 October 2020 through 29 October 2020
ER -