Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Towards Ultra-Efficient Machine Learning for Edge Inference

Creative Commons 'BY-NC-ND' version 4.0 license
Abstract

Deep neural networks (DNNs) have been increasingly deployed on and integrated with edge devices, such as mobile phones, drones, robots and wearables. To run DNN inference directly on edge devices (a.k.a. edge inference) with a satisfactory performance, optimizing the DNN design (e.g., neural architecture and quantization policy) is crucial. However, designing an optimal DNN for even a single edge device often needs repeated de- sign iterations and is non-trivial. Worse yet, DNN model developers commonly need to serve extremely diverse edge devices. Therefore, it has become crucially important to scale up the optimization of DNNs for edge inference using automated approaches. In this dissertation, we come up with several solutions to scalably and efficiently optimize the DNN design for diverse edge devices, with increasingly flexible design consideration. Firstly, consider the fact that a large number of diverse DNN models can be generated by navigating through the design space in terms of different neural architectures and compression techniques, we look into the problem that how to select the best DNN model out of many choices for each individual edge device. We propose a novel automated and user-centric DNN selection engine, called Aquaman, which leverages users’ Quality of Experience (QoE) feedback to guide DNN selection decisions. The core of Aquaman is a machine learning-based QoE predictor which is continuously updated online, and neural bandit learning to balance exploitation and exploration. However, the assumption of a pre-existing DNN model pool in Aquaman is essentially limited and may not suit any given edge device’s best interest. Therefore, we take into consideration the design freedom of neural architectures by resorting to hardware-aware neural architecture search (NAS) for optimizing the DNN design for a given target device. NAS can thoroughly explore the model architecture search space, and automatically discover the optimal combination of building blocks, namely a model, for any target device. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices. We address the scalability challenge by exploiting latency monotonicity – the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality. In the absence of strong latency monotonicity, we also propose an efficient proxy adaptation technique to significantly boost the latency monotonicity. Our results highlight that, by using just one proxy device, we can find almost the same Pareto-optimal architectures as the existing per-device NAS, while avoiding the prohibitive cost of building a latency predictor for each device, reducing the cost of hardware-aware NAS from O(N) to O(1). Further, besides the design flexibility of neural architectures brought by NAS (i.e. software design), exploring the hardware design space such as optimizing hardware accelerators built on FPGA or ASIC, as well as the corresponding dataflows (e.g., scheduling DNN computations and mapping them on hardware), is also critical for speeding up DNN execution. While hardware-software co-design can further optimize DNN performance, it also exponentially enlarges the search space to practically infinity, presenting significant challenges. By settling in-between the fully-decoupled approach and the fully-coupled hardware-software co-design approach, we propose a new semi-decoupled approach to reduce the size of the total co-search space by orders of magnitude, yet without losing design optimality. Our approach again builds on the latency and energy monotonicity – neural architectures’ ranking orders in terms of inference latency and energy consumption on different accelerators are highly correlated. Our results confirm that strong latency and energy monotonicity exist among different accelerator designs. More importantly, by using one candidate accelerator as the proxy and obtaining its small set of optimal architectures, we can reuse the same architecture set for other accelerator candidates during the hardware search stage.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View