TY - GEN
T1 - Towards ultra-high performance and energy efficiency of deep learning systems
T2 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
AU - Wang, Yanzhi
AU - Ding, Caiwen
AU - Li, Zhe
AU - Yuan, Geng
AU - Liao, Siyu
AU - Ma, Xiaolong
AU - Yuan, Bo
AU - Qian, Xuehai
AU - Tang, Jian
AU - Qiu, Qinru
AU - Lin, Xue
N1 - Publisher Copyright:
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2018
Y1 - 2018
N2 - Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.
AB - Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.
UR - http://www.scopus.com/inward/record.url?scp=85052762317&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052762317&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85052762317
T3 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
SP - 4235
EP - 4243
BT - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
PB - AAAI Press
Y2 - 2 February 2018 through 7 February 2018
ER -