- Main
Resilient 3D Network-on-Chip Design and Analysis
- Yaghini, Pooria M.
- Advisor(s): Baghrzadeh, Nader
Abstract
Like every other major changes in computer architecture, exascale computing, targeted
for 2020, requires dramatic and unanticipated shifts in different perspectives.
The biggest challenge facing this trend is to design an exascale system with a hundredfold
optimization on the estimated power cost of above $2.5B per year for a system
designed with current technology. It has been reported that a large portion of total
power is consumed for communication through interconnection network. Communication
between the computational components of System-on-Chip (SoC) designs can
account for more than 25 percent of the energy dissipation of the whole system. NoC
is recognized by many researchers as the best communication infrastructure for manycore
systems. To lower communication power, researchers have proposed the idea
of designing thinned and stacked 3D ICs. 3D ICs, fabricated using Through-Silicon
Via (TSV), offer higher bandwidths, smaller form factors, shorter wire lengths, lower
power, and better performance than traditional 2D ICs. The combination of 3D structures
and NoC is the most promising approach for obtaining the projected performance
and power requirements for exascale systems. Besides the extremely constrained power budget, achieving an acceptable level of resiliency for 1,000,000 cores in an exascale
system is a crucial challenge. Communication reliability, due to the huge amount of
data movement in these systems, plays a key role.
In this dissertation, the focus is to identify, characterize, and mitigate the reliability
threats of TSV-based 3D communication structures, specifically threats introduced by
TSV-to-TSV coupling fault.
In the first step which is the identification of the reliability threats, the potential physical
faults of a baseline TSV-based 3D NoC architecture by targeting Two-dimensional
(2D) NoC components and their inter-die connections is classified. Subsequently, TSV
issues, thermal concerns, and Single Event Effect (SEE) are investigated and categorized,
in order to propose evaluation metrics for inspecting the resiliency of 3D NoC
designs.
Then, in the second step, having overviewed the common TSV issues, a framework
is proposed for quantifying the 3D NoC reliability using formal methods. TSV issues
are modeled as a time-invariant failure probability and a reliability criterion for TSVbased
NoC is defined. The relationship between NoC reliability and TSV failure is
quantified. For the first time, the reliability criterion is reduced to a tractable closedform
expression that requires a single Monte Carlo simulation.
In the third step, a system-level TSV coupling fault model is proposed, which models
the capacitive coupling effect, considering thermal impact, at circuit-level accuracy.
This model can be plugged into any system-level and RTL-level TSV-based 3D-IC
data-oriented simulator. Having analyzed and recorded the TSV coupling effect at
circuit-level, these effects are applied to the Through-Silicon Vias (TSVs) dynamically in system-level simulations at runtime through precise monitoring and calculation. The
proposed fault model is potentially useful for evaluating the reliability of 3D many-core
applications in which TSV coupling may lead to failure.
After setting up the TSV coupling fault modeling framework, multiple coding approaches
are proposed to prevent coupling fault occurrence on TSV links. In these
approaches, the coupling fault effect is addressed by diagnosing the hazardous current
flow direction patterns of the TSV bus, and encoding the data bits to avoid those patterns
at run-time. Different coding schemes are devised to address both types of TSV
coupling, inductive and capacitive. These approaches are devised to be low overhead,
fast, and highly efficient. Empirical simulations are performed with both random and
realistic benchmarks, including PARSEC, to demonstrate the efficacy of the devised
approaches. All these approaches are also implemented at hardware-level, to have a
realistic estimate of the imposed overheads at logic-level. Experimental results show
that these approaches improve the communication reliability over TSV links significantly,
with no extra TSV and negligible information redundancy or hardware logic
overhead.
Overall, this work provides a rich set of TSV coupling-avoidance techniques, besides an
accurate and fast TSV coupling fault modeling simulation framework, for efficient and
effective design of reliable 3D communication architectures. It helps DFT designers to
more easily design robust TSV links.