A method and system for detecting and localizing a target audio event in an audio clip is disclosed. The method and system use utilizes a hierarchical approach in which a dilated convolutional neural network to detect the presence of the target audio event anywhere in an audio clip based on high level audio features. If the target audio event is detected somewhere in the audio clip, the method and system further utilizes a robust audio vector representation that encodes the inherent state of the audio as well as a learned relationship between state of the audio and the particular target audio event that was detected in the audio clip. A bi-directional long short term memory classifier is used to model long term dependencies and determine the boundaries in time of the target audio event within the audio clip based on the audio vector representations.
|Original language||English (US)|
|State||Published - Sep 6 2019|