|
|
|
|
|
|
|
|
|
|
|
We introduce an end-to-end framework for jointly reducing model size on disk and faster inference. Prior works such as pruning based approaches (▲) optimize for compute in terms of FLOPs (y-axis) and have low reductions in model size on disk (x-axis). Codebook quantization based approaches (■) on the other hand, optimize for model size and have very little compute reduction. We jointly optimize for both quantities in an end-to-end manner achieving higher levels of reduction. |
We introduce LilNetX, an end-to-end trainable technique for neural networks that enables learning models with specified accuracy-compression-computation tradeoff. Prior works approach these problems one at a time and often require postprocessing or multistage training. Our method, on the other hand, constructs a joint training objective that penalizes the self-information of network parameters in a latent representation space to encourage small model size, while also introducing priors to increase structured sparsity in the parameter space to reduce computation. When compared with existing state-of-the-art model compression methods, we achieve up to 50% smaller model size and 98% model sparsity on ResNet-20 on the CIFAR-10 dataset as well as 31% smaller model size and 81% structured sparsity on ResNet-50 trained on ImageNet while retaining the same accuracy as these methods. The resulting sparsity can improve the inference time by a factor of almost 1.86x in comparison to a dense ResNet-50 model. |
|
Our approach maintains quantized latent representations for efficient storage which can be decoded to obtain sparse model weights for faster inference. We utilize a special type of slice sparsity for obtaining inference speedups without hardware modifications. A joint loss function is optimized simultaenously for downstream performance, model size, and compute. Our approach is generalizable requiring no post-hoc training stages and can be applied to any task. |
Sharath Girish, Kamal Gupta, Saurabh Singh, Kamal Gupta LilNetX: Lightweight Networks with EXtreme Model Compression and Structured Sparsification In Conference, ICLR, 2023. (hosted on OpenReview) |
Acknowledgements |