Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

A Memory-Efficient YOLO Object Detection Convolutional Neural Network Inference Engine On The KiloCore 2 Manycore Platform

Abstract

Object Detection is one of the most resource-intensive tasks for Convolutional Neural Networks (CNN). To predict the category of the objects and at the same time determine their location, the Object Detection network has to use a very deep structure, typically 10 to 50 layers, along with a huge number of learnable parameters ranging from a few million to over a billion. Many architectures have been implemented on various hardware platforms to accelerate the inference speed of Object Detection. For example, the cuDNN library on Nvidia GPUs and the Intel DLIA framework on x86 CPUs. However, they either consume a lot of power or require large memory to perform the acceleration algorithm, making them unsuitable for edge-computing use cases where power is limited and energy must be conserved.

This thesis presents an efficient and high-throughput network inference implementation for the YOLOv3-Tiny Object Detection system on the KiloCore 2 manycore platform. YOLOv3-Tiny is a light-weight and accurate Object Detection network with only 13 Convolution layers and 8,861,918 learnable parameters, and the KiloCore 2 platform is a low-power manycore processor chip with 697 programmable cores and a high-speed on-chip communication network.

Specifically, this thesis presents two software optimization techniques to relax the memory requirements of YOLOv3-Tiny, and a scalable hardware architecture for calculating convolutions. On the software side, low-precision quantization reduces all the parameters from 16-bits to 8-bits while still maintaining 90% of the accuracy, and Batch Normalization (BN) Folding is used to compress the computation complexity of the network, removing all the BN layers together with 12,736 parameters. This thesis describes a standardized process of applying quantization and BN Folding so that these optimization techniques can be implemented on any CNN. On the hardware side, a scalable and modular architecture to calculate convolution utilizing a maximum of 536 cores on KiloCore 2 is presented, achieving a high throughput per chip area of 1.002 frames per second/cm^2 and offers a low energy consumption of 2.232 J/image.

Compared with other hardware platforms such as general-purpose CPUs and specialized GPU accelerators, this implementation achieves a less than 5% reduction in throughput per chip area, but offers 9.17x to 441x greater throughput per watt. Furthermore, to run the full YOLOv3-Tiny network, this implementation requires only 17.72 MB of off-chip memory for parameters and 896 KB of on-chip memory, which provides a 49.5x to 64.8x memory reduction compared with the GPU implementations.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View