TY - GEN
T1 - 7.5 A 65nm 0.39-to-140.3TOPS/W 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture
AU - Yue, Jinshan
AU - Liu, Ruoyang
AU - Sun, Wenyu
AU - Yuan, Zhe
AU - Wang, Zhibo
AU - Tu, Yung Ning
AU - Chen, Yi Ju
AU - Ren, Ao
AU - Wang, Yanzhi
AU - Chang, Meng Fan
AU - Li, Xueqing
AU - Yang, Huazhong
AU - Liu, Yongpan
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/3/6
Y1 - 2019/3/6
N2 - Energy-efficient neural-network (NN) processors have been proposed for battery-powered deep-learning applications, where convolutional (CNN), fully-connected (FC) and recurrent NNs (RNN) are three major workloads. To support all of them, previous solutions [1-3] use either area-inefficient heterogeneous architectures, including CNN and RNN cores, or an energy-inefficient reconfigurable architecture. A block-circulant algorithm [4] can unify CNN/FC/RNN workloads with transpose-domain acceleration, as shown in Fig. 7.5.1. Once NN weights are trained using the block-circulant pattern, all workloads are transformed into consistent matrix-vector multiplications (MVM), which can potentially achieve 8 to-128× storage savings and a O({n}{2})-to-O(nlog(n)) computation complexity reduction.
AB - Energy-efficient neural-network (NN) processors have been proposed for battery-powered deep-learning applications, where convolutional (CNN), fully-connected (FC) and recurrent NNs (RNN) are three major workloads. To support all of them, previous solutions [1-3] use either area-inefficient heterogeneous architectures, including CNN and RNN cores, or an energy-inefficient reconfigurable architecture. A block-circulant algorithm [4] can unify CNN/FC/RNN workloads with transpose-domain acceleration, as shown in Fig. 7.5.1. Once NN weights are trained using the block-circulant pattern, all workloads are transformed into consistent matrix-vector multiplications (MVM), which can potentially achieve 8 to-128× storage savings and a O({n}{2})-to-O(nlog(n)) computation complexity reduction.
UR - http://www.scopus.com/inward/record.url?scp=85063536330&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063536330&partnerID=8YFLogxK
U2 - 10.1109/ISSCC.2019.8662360
DO - 10.1109/ISSCC.2019.8662360
M3 - Conference contribution
AN - SCOPUS:85063536330
T3 - Digest of Technical Papers - IEEE International Solid-State Circuits Conference
SP - 138
EP - 140
BT - 2019 IEEE International Solid-State Circuits Conference, ISSCC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE International Solid-State Circuits Conference, ISSCC 2019
Y2 - 17 February 2019 through 21 February 2019
ER -