TY - GEN
T1 - Enhancing bidirectional association between deep image representations and loosely correlated texts
AU - Chen, Qiuwen
AU - Qiu, Qinru
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/10/31
Y1 - 2016/10/31
N2 - The problem of bridging the gap between image and natural language has gained more and more attention in recent years. This paper continues to push the study and improves the bidirectional retrieval performance across the modalities. Unlike previous works that target at single sentence densely describing the image objects, we extend the focus to associating deep image representations with noisy texts that are only loosely correlated. Based on text-image fragment embedding, our model employs a sequential configuration, connects two embedding stages together. The first stage learns the relevancy of the text fragments, and the second stage uses the filtered output from the first one to improve the matching results. The model also integrates multiple convolutional neural networks (CNN) to construct the image fragments, in which rich context information such as human faces can be extracted to increase the alignment accuracy. The proposed method is evaluated with both synthetic dataset and real-world dataset collected from picture news website. The results show up to 50% ranking performance improvement over the comparison models.
AB - The problem of bridging the gap between image and natural language has gained more and more attention in recent years. This paper continues to push the study and improves the bidirectional retrieval performance across the modalities. Unlike previous works that target at single sentence densely describing the image objects, we extend the focus to associating deep image representations with noisy texts that are only loosely correlated. Based on text-image fragment embedding, our model employs a sequential configuration, connects two embedding stages together. The first stage learns the relevancy of the text fragments, and the second stage uses the filtered output from the first one to improve the matching results. The model also integrates multiple convolutional neural networks (CNN) to construct the image fragments, in which rich context information such as human faces can be extracted to increase the alignment accuracy. The proposed method is evaluated with both synthetic dataset and real-world dataset collected from picture news website. The results show up to 50% ranking performance improvement over the comparison models.
UR - http://www.scopus.com/inward/record.url?scp=85007236248&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85007236248&partnerID=8YFLogxK
U2 - 10.1109/IJCNN.2016.7727603
DO - 10.1109/IJCNN.2016.7727603
M3 - Conference contribution
AN - SCOPUS:85007236248
T3 - Proceedings of the International Joint Conference on Neural Networks
SP - 3164
EP - 3171
BT - 2016 International Joint Conference on Neural Networks, IJCNN 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2016 International Joint Conference on Neural Networks, IJCNN 2016
Y2 - 24 July 2016 through 29 July 2016
ER -