TY - GEN
T1 - Accelerating Block-Circulant Matrix-Based Neural Network Layer on a General Purpose Computing Platform
T2 - Future of Information and Communication Conference, FICC 2020
AU - Pugdeethosapol, Krittaphat
AU - Jin, Zhao
AU - Rider, Daniel
AU - Qiu, Qinru
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Deep neural networks (DNNs) have become a powerful tool and enabled the state-of-the art accuracy on many challenging tasks. However, large-scale DNNs highly consume both computational time and storage space. To optimize and improve the performance of the network while maintaining the accuracy, the block-circulant matrix-based (BCM) algorithm has been introduced. BCM utilizes the Fast Fourier Transform (FFT) with block-circulant matrices to compute the output of each layer of the network. Unlike conventional pruning techniques, the network structure is maintained while using the BCM. Compared to conventional matrix implementation, the BCM reduces the computational complexity of a neural network layer from O(n^2) to O(n^2/k), and it has been proven to be highly effective when implemented using customized hardware, such as FPGAs. However, its performance suffers from overhead of FFT and matrix reshaping on general purpose computing platforms. In certain cases, using the BCM does not improve the total computation time of the networks at all. In this paper, we propose a parallel implementation of the BCM layer and guidelines that generally lead to better implementation practice is provided. The guidelines run across popular implementation language and packages including Python, numpy, intel-numpy, tensorflow, and nGraph.
AB - Deep neural networks (DNNs) have become a powerful tool and enabled the state-of-the art accuracy on many challenging tasks. However, large-scale DNNs highly consume both computational time and storage space. To optimize and improve the performance of the network while maintaining the accuracy, the block-circulant matrix-based (BCM) algorithm has been introduced. BCM utilizes the Fast Fourier Transform (FFT) with block-circulant matrices to compute the output of each layer of the network. Unlike conventional pruning techniques, the network structure is maintained while using the BCM. Compared to conventional matrix implementation, the BCM reduces the computational complexity of a neural network layer from O(n^2) to O(n^2/k), and it has been proven to be highly effective when implemented using customized hardware, such as FPGAs. However, its performance suffers from overhead of FFT and matrix reshaping on general purpose computing platforms. In certain cases, using the BCM does not improve the total computation time of the networks at all. In this paper, we propose a parallel implementation of the BCM layer and guidelines that generally lead to better implementation practice is provided. The guidelines run across popular implementation language and packages including Python, numpy, intel-numpy, tensorflow, and nGraph.
KW - Acceleration
KW - Block-circulant matrix
KW - Deep learning
KW - Parallel computing
UR - http://www.scopus.com/inward/record.url?scp=85081402280&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85081402280&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-39442-4_32
DO - 10.1007/978-3-030-39442-4_32
M3 - Conference contribution
AN - SCOPUS:85081402280
SN - 9783030394417
T3 - Advances in Intelligent Systems and Computing
SP - 419
EP - 435
BT - Advances in Information and Communication - Proceedings of the 2020 Future of Information and Communication Conference FICC
A2 - Arai, Kohei
A2 - Kapoor, Supriya
A2 - Bhatia, Rahul
PB - Springer
Y2 - 5 March 2020 through 6 March 2020
ER -