TY - GEN
T1 - Compact Multi-level Sparse Neural Networks with Input Independent Dynamic Rerouting
AU - Qin, Minghai
AU - Zhang, Tianyun
AU - Sun, Fei
AU - Chen, Yen Kuang
AU - Fardad, Makan
AU - Wang, Yanzhi
AU - Xie, Yuan
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Deep neural networks (DNNs) have shown to provide superb performance in many real life applications, but their large computation cost and storage requirement have prevented them from being deployed to many edge and internet-of-things (IoT) devices. Sparse deep neural networks, whose majority weight parameters are zeros, can substantially reduce the computation complexity and memory consumption of the models. In real-use scenarios, devices may suffer from large fluctuations of the available computation and memory resources under different environment, and the quality of service (QoS) is difficult to maintain due to the long tail inferences with large latency. Facing the real-life challenges, we propose to train a sparse model that supports multiple sparse levels. That is, a hierarchical structure of weights are satisfied such that the locations and the values of the non-zero parameters of the more-sparse sub-model are a subset of the less-sparse sub-model. In this way, one can dynamically select the appropriate sparsity level during inference, while the storage cost is capped by the least sparse sub-model. We have verified our methodologies on a variety of DNN models and tasks, including the ResNet-50, PointNet++, GNMT, and graph attention networks. We obtain sparse sub-models with an average of 13.38% weights and 14.97% FLOPs, while the accuracies are as good as their dense counterparts. More-sparse sub-models with 5.38% weights and 4.47% of FLOPs, which are subsets of the less-sparse ones, can be obtained with only 3.25% relative accuracy loss. In addition, our proposed hierarchical model structure supports the mechanism to inference the first part of the model with less sparsity, and dynamically reroute to the more-sparse level if the real-time latency constraint is estimated to be violated. Preliminary analysis shows that we can improve the QoS by one or two nines depending on the task and the computation-memory resources of the inference engine.
AB - Deep neural networks (DNNs) have shown to provide superb performance in many real life applications, but their large computation cost and storage requirement have prevented them from being deployed to many edge and internet-of-things (IoT) devices. Sparse deep neural networks, whose majority weight parameters are zeros, can substantially reduce the computation complexity and memory consumption of the models. In real-use scenarios, devices may suffer from large fluctuations of the available computation and memory resources under different environment, and the quality of service (QoS) is difficult to maintain due to the long tail inferences with large latency. Facing the real-life challenges, we propose to train a sparse model that supports multiple sparse levels. That is, a hierarchical structure of weights are satisfied such that the locations and the values of the non-zero parameters of the more-sparse sub-model are a subset of the less-sparse sub-model. In this way, one can dynamically select the appropriate sparsity level during inference, while the storage cost is capped by the least sparse sub-model. We have verified our methodologies on a variety of DNN models and tasks, including the ResNet-50, PointNet++, GNMT, and graph attention networks. We obtain sparse sub-models with an average of 13.38% weights and 14.97% FLOPs, while the accuracies are as good as their dense counterparts. More-sparse sub-models with 5.38% weights and 4.47% of FLOPs, which are subsets of the less-sparse ones, can be obtained with only 3.25% relative accuracy loss. In addition, our proposed hierarchical model structure supports the mechanism to inference the first part of the model with less sparsity, and dynamically reroute to the more-sparse level if the real-time latency constraint is estimated to be violated. Preliminary analysis shows that we can improve the QoS by one or two nines depending on the task and the computation-memory resources of the inference engine.
KW - artificial intelligence
KW - deep neural networks
KW - weight pruning
UR - http://www.scopus.com/inward/record.url?scp=85156108067&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85156108067&partnerID=8YFLogxK
U2 - 10.1109/ICTAI56018.2022.00088
DO - 10.1109/ICTAI56018.2022.00088
M3 - Conference contribution
AN - SCOPUS:85156108067
T3 - Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
SP - 555
EP - 562
BT - Proceedings - 2022 IEEE 34th International Conference on Tools with Artificial Intelligence, ICTAI 2022
A2 - Reformat, Marek
A2 - Zhang, Du
A2 - Bourbakis, Nikolaos G.
PB - IEEE Computer Society
T2 - 34th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2022
Y2 - 31 October 2022 through 2 November 2022
ER -