Xiao, Bingjun

Communication Optimization for Customizable Domain-Specific Computing

2015

Xiao, Bingjun
Advisor(s): CONG, JINGSHENG JASON

Abstract

This dissertation investigates the communication optimization for customizable domain-specific computing at different levels in a customizable heterogeneous platform (CHP) to improve the system performance and energy efficiency.

Fabric-level optimization driven by emerging devices. Programmable fabrics (e.g., FPGAs) can be used to improve domain-specific computing by >10x in energy efficiency over CPUs since FPGAs can be customized to the application kernels in the target domain. But the programmable interconnects inside FPGAs occupy >50% of the FPGA area, delay and power. We propose a novel architecture of programmable interconnects based on resistive RAM (RRAM), a type of emerging device with high density and low power. We optimize the layout and the programming circuit of the new architecture. We also extend RRAM benefits to routing buffers. We observe the high defect rate in the emerging RRAM manufacturing and further develop a defect-aware communication mechanism. Conventional defect avoidance leaves a large portion of the chip in the new architecture unusable. So we propose new defect utilization methodologies by treating stuck-closed defects as shorting constraints in the routing of signals. We develop a scalable algorithm to perform timing-driven routing under these extra constraints and successfully suppress the impact of defects.

Chip-level optimization driven by accelerator-centric architectures. A chip can also be customized to an application domain by integrating a sea of accelerators designed for the frequently used kernels in the domain. The design of interconnects among customized accelerators and shared resources (e.g., shared memories) is a serious challenge in chip design. Accelerators run 100x faster than CPUs and post a high data demand on the communication infrastructure. To address this challenge, we develop a novel design of interconnects between accelerators and shared memories and exploit several optimization opportunities that emerge in accelerator-rich computing platforms. Experiments show that our design outperforms prior work that was optimized for CPU cores or signal routing. Another design challenge lies in the data reuse optimization within an accelerator to minimize its off-chip accesses and on-chip buffer usage. Since the fully pipelined computation kernel consumes large amounts of data every clock cycle, and the data access pattern is the major difference among applications, existing accelerators use ad hoc data reuse schemes that are carefully tuned per application to fit the data demand. To reduce the engineering cost of accelerator-rich architectures, we develop a data reuse infrastructure that is generalized for the stencil computation domain and can be instantiated to the optimal design for any application in the domain. We demonstrate the robustness of our method over a set of real-life benchmarks.

Server-level and cluster-level optimization driven by big data. In the era of big data, workloads can no longer fit into a single chip. Most data are stored in disks, and we can only load a small part of it into main memories during computation. Due to the low access speed of disks, our primary design goal becomes minimization of the data transfer between disks and the main memory. We select a popular big data application, convolutional neural network (CNN), as a case study. We analyze the linear algebraic properties of CNN, and propose algorithmic modifications to reduce the total computational workload and the disk access. Furthermore, when the application data become even larger, it needs to be distributed among a cluster of server nodes. This motivates us to develop an accelerator-centric computing cluster. We test two machine learning applications, logistic regression and artificial neural network (ANN), on our prototyping cluster and try to minimize the total data transfer incurred during the computation in this cluster. We select the distributed stochastic gradient descent (dSGD) as our training algorithm to eliminate the inter-node communication within a training iteration. We also deploy an in-memory cluster computing infrastructure, Spark, to eliminate the inter-node communication across training iterations. The baseline Spark only supports CPUs, and we develop a software layer to allow Spark tasks to offload their major computation to accelerators which are equipped by each server node. During the computation offloading, we group multiple tasks into a batch and transfer it to the target accelerator in one transaction to minimize the setup overhead of the data transfer between accelerators and host servers. We further realize accelerator data caching to eliminate the unnecessary data transfer of training data based on the properties of iterative machine learning applications.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Communication Optimization for Customizable Domain-Specific Computing