C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs

Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, Yun Liang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Citations (Scopus)

Abstract

Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from (k2) to (k). Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from (k2) to (klogk). The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.

Original languageEnglish (US)
Title of host publicationFPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
PublisherAssociation for Computing Machinery, Inc
Pages11-20
Number of pages10
Volume2018-February
ISBN (Electronic)9781450356145
DOIs
StatePublished - Feb 15 2018
Event2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2018 - Monterey, United States
Duration: Feb 25 2018Feb 27 2018

Other

Other2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2018
CountryUnited States
CityMonterey
Period2/25/182/27/18

Fingerprint

Field programmable gate arrays (FPGA)
Energy efficiency
Data storage equipment
Long short-term memory
Fast Fourier transforms
Computational complexity
Acoustics
Chemical activation
Degradation

Keywords

  • Block-circulant matrix
  • Compression
  • FFT
  • FPGA
  • LSTM
  • RNNs

ASJC Scopus subject areas

  • Hardware and Architecture
  • Electrical and Electronic Engineering

Cite this

Wang, S., Li, Z., Ding, C., Yuan, B., Qiu, Q., Wang, Y., & Liang, Y. (2018). C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Vol. 2018-February, pp. 11-20). Association for Computing Machinery, Inc. https://doi.org/10.1145/3174243.3174253

C-LSTM : Enabling efficient LSTM using structured compression techniques on FPGAs. / Wang, Shuo; Li, Zhe; Ding, Caiwen; Yuan, Bo; Qiu, Qinru; Wang, Yanzhi; Liang, Yun.

FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Vol. 2018-February Association for Computing Machinery, Inc, 2018. p. 11-20.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wang, S, Li, Z, Ding, C, Yuan, B, Qiu, Q, Wang, Y & Liang, Y 2018, C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. in FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. vol. 2018-February, Association for Computing Machinery, Inc, pp. 11-20, 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2018, Monterey, United States, 2/25/18. https://doi.org/10.1145/3174243.3174253
Wang S, Li Z, Ding C, Yuan B, Qiu Q, Wang Y et al. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Vol. 2018-February. Association for Computing Machinery, Inc. 2018. p. 11-20 https://doi.org/10.1145/3174243.3174253
Wang, Shuo ; Li, Zhe ; Ding, Caiwen ; Yuan, Bo ; Qiu, Qinru ; Wang, Yanzhi ; Liang, Yun. / C-LSTM : Enabling efficient LSTM using structured compression techniques on FPGAs. FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Vol. 2018-February Association for Computing Machinery, Inc, 2018. pp. 11-20
@inproceedings{2f79903054bb4d7f8812869e4e9f1300,
title = "C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs",
abstract = "Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from (k2) to (k). Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from (k2) to (klogk). The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.",
keywords = "Block-circulant matrix, Compression, FFT, FPGA, LSTM, RNNs",
author = "Shuo Wang and Zhe Li and Caiwen Ding and Bo Yuan and Qinru Qiu and Yanzhi Wang and Yun Liang",
year = "2018",
month = "2",
day = "15",
doi = "10.1145/3174243.3174253",
language = "English (US)",
volume = "2018-February",
pages = "11--20",
booktitle = "FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - C-LSTM

T2 - Enabling efficient LSTM using structured compression techniques on FPGAs

AU - Wang, Shuo

AU - Li, Zhe

AU - Ding, Caiwen

AU - Yuan, Bo

AU - Qiu, Qinru

AU - Wang, Yanzhi

AU - Liang, Yun

PY - 2018/2/15

Y1 - 2018/2/15

N2 - Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from (k2) to (k). Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from (k2) to (klogk). The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.

AB - Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from (k2) to (k). Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from (k2) to (klogk). The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.

KW - Block-circulant matrix

KW - Compression

KW - FFT

KW - FPGA

KW - LSTM

KW - RNNs

UR - http://www.scopus.com/inward/record.url?scp=85052102073&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052102073&partnerID=8YFLogxK

U2 - 10.1145/3174243.3174253

DO - 10.1145/3174243.3174253

M3 - Conference contribution

AN - SCOPUS:85052102073

VL - 2018-February

SP - 11

EP - 20

BT - FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

PB - Association for Computing Machinery, Inc

ER -