RESEARCH


Memory System


Accelerator


Model Compression

In deep learning, model compression is the key technique in real-world applications and model deployment. The goals of the model compression are as follows; minimizing the model size and decreasing model inference latency, and these goals can be achieved by various methods such as pruning, knowledge distillation, factorization, quantization, and so on. Our research interest is pruning and quantization among model compression techniques. Pruning eliminates redundant parts of weight parameters to make the network smaller and lower the computation cost, and pruning could be done elements-wise, or in structural manners; row/col/channel-wise. Quantization reduces the number of bits required to represent the model, such as weights and activation. Quantization can reduce the size of the model. It can also decrease inference latency by using a dedicated framework and hardware.