Abstract
The problem of bridging the gap between image and natural language has gained more and more attention in recent years. This paper continues to push the study and improves the bidirectional retrieval performance across the modalities. Unlike previous works that target at single sentence densely describing the image objects, we extend the focus to associating deep image representations with noisy texts that are only loosely correlated. Based on text-image fragment embedding, our model employs a sequential configuration, connects two embedding stages together. The first stage learns the relevancy of the text fragments, and the second stage uses the filtered output from the first one to improve the matching results. The model also integrates multiple convolutional neural networks (CNN) to construct the image fragments, in which rich context information such as human faces can be extracted to increase the alignment accuracy. The proposed method is evaluated with both synthetic dataset and real-world dataset collected from picture news website. The results show up to 50% ranking performance improvement over the comparison models.
Original language | English (US) |
---|---|
Title of host publication | 2016 International Joint Conference on Neural Networks, IJCNN 2016 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 3164-3171 |
Number of pages | 8 |
Volume | 2016-October |
ISBN (Electronic) | 9781509006199 |
DOIs | |
State | Published - Oct 31 2016 |
Event | 2016 International Joint Conference on Neural Networks, IJCNN 2016 - Vancouver, Canada Duration: Jul 24 2016 → Jul 29 2016 |
Other
Other | 2016 International Joint Conference on Neural Networks, IJCNN 2016 |
---|---|
Country | Canada |
City | Vancouver |
Period | 7/24/16 → 7/29/16 |
ASJC Scopus subject areas
- Software
- Artificial Intelligence