Memory Augmented Deep Recurrent Neural Network for Video Question Answering

Chengxiang Yin, Jian Tang, Zhiyuan Xu, Yanzhi Wang

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Video question answering (VideoQA) is a very important but challenging multimedia task, which automatically analyzes questions and videos and generates accurate answers. However, research on VideoQA is still in its infancy. In this article, we propose a novel memory augmented deep recurrent neural network (MA-DRNN) model for VideoQA, which features a new method for encoding videos and questions, and memory augmentation using the emerging differentiable neural computer (DNC). Specifically, we encode textual (questions) information before visual (videos) information, which leads to better visual-textual representations. Moreover, we leverage DNC (with an external memory) for storing and retrieving useful information in questions and videos, and modeling the long-term visual-textual dependence. To evaluate the proposed model, we conducted extensive experiments using the VTW data set and MSVD-QA data set, which are both Widely used large-scale video data sets for language-level understanding. The experimental results have well validated the proposed model and showed that it outperforms the state-of-the-art in terms of various accuracy-related metrics.

Original languageEnglish (US)
Article number8845771
Pages (from-to)3159-3167
Number of pages9
JournalIEEE Transactions on Neural Networks and Learning Systems
Volume31
Issue number9
DOIs
StatePublished - Sep 2020

Keywords

  • Deep learning
  • differentiable neural computer (DNC)
  • memory augmented neural network
  • recurrent neural network (RNN)
  • video question answering (VideoQA)

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Memory Augmented Deep Recurrent Neural Network for Video Question Answering'. Together they form a unique fingerprint.

Cite this