### UC Santa Cruz UC Santa Cruz Previously Published Works

**Title** HCDN: Hybrid-Mode Clock Distribution Networks

Permalink https://escholarship.org/uc/item/4g45d9dn

**Authors** Islam, Riadul Guthaus, Matthew R.

Publication Date 2018-08-01

Peer reviewed

# HCDN: Hybrid-Mode Clock Distribution Networks

Riadul Islam, Member, IEEE, and Matthew R. Guthaus, Senior Member, IEEE

Abstract—We propose a new hybrid clock distribution scheme that uses global current-mode (CM) and local voltage-mode (VM) clocking to distribute a high-performance clock signal with reduced power consumption. In order to enable hybrid clocking, we propose two new current-to-voltage converters. The converters are simple current receiver circuits based on amplifier and current-mirror circuits. The global clocking is bufferless and relies on current rather than voltage, which reduces the jitter. The local VM network improves compatibility with traditional CMOS logic. The hybrid clock distribution network exhibits 29% lower average power and 54% lower jitter-induced skew in a symmetric network compared to traditional VM clocks. To use hybrid clocking efficiently, we present a methodology to identify the optimal cluster size and the number of required receiver circuits, which we demonstrate using the ISPD 2009, ISPD 2010, and ISCAS89 testbench networks. At 1-2GHz clock frequency, the proposed methodology uses up to 45% and 42%lower power compared to a synthesized buffered VM scheme using ISPD 2009 and ISPD 2010 testbenches, respectively. In addition, the proposed hybrid clocking scheme saves up to 50% and 59% of power compared to a buffered scheme using the ISCAS89 benchmark circuit at 1GHz and 2GHz clock frequency, respectively.

Index Terms—Clock synthesis, jitter, low-power, clock distribution network.

#### I. INTRODUCTION

Due to the continuous scaling of CMOS technology, digital integrated circuits (ICs) have leveraged corresponding technology improvements to achieve lower power, lower area, and higher performance/speed. However, the bounds of instructionlevel parallelism (ILP), integration of disparate functionality, and process variations limit overall system-on-chip (SOC) performance. This is primarily due to the reciprocal relationships between power, performance, and reliability. While a synchronous digital IC clock distribution network (CDN) consumes a significant amount of power, designing a reliable CDN ensures correct functionality of the network and ultimately determines yield.

In order to improve yield, designers spend significant time to meet setup and hold-time constraints. A direct source of these timing uncertainties is process variation-induced skew and jitter. In addition, at high-frequency operation, only a limited portion of the clock period is delegated to timing uncertainty. As a result, skew and jitter are a primary hurdle for today's processor speed limits in conjunction with the total power budget.

Timing jitter is the uncertainty of the rising/falling edges of the circuit clock and the width/duration of the duty cycle. It has to be measured accurately in order to avoid operating malfunction. However, off-chip jitter measurement requires an on-chip high-performance driver to deliver the distortionless clock signal to an external instrument through a highfrequency pin [1], [2]. In addition, ensuring the reliability of the delivered signal becomes increasingly difficult due to the presence of noise in high-frequency phase-locked-loops (PLLs). In contrast, on-chip jitter measurement requires extra circuitry, primarily consisting of a cascaded time difference amplifier [3] or ring oscillators [4], without increasing pin counts. As a result, it adds fewer parasitics and improves efficiency compared to the off-chip method. However, these schemes to measure the subpicosecond jitter precisely increase design complexity and also increase the overall power and area of the network. Hence, in a high-performance processor, it is desirable to have a low-jitter clock with low power consumption.

Current-mode (CM) signals are more robust to variability and consume less power compared to the counterpart voltagemode (VM) signals [5]–[9], [14]. However, most CM signaling is restricted to off-chip one-to-one signal transmission [6], [8]. Only in recent years have researchers started using CM signaling in CDNs to achieve higher reliability and lower power compared to traditional buffered VM clocking [5], [7], [10]–[12]. While most of these works addressed clock reliability issues, considering process variation, transistor mismatch, supply-voltage fluctuation, and crosstalk induced skew, they focused mainly on skew and did not consider clock jitter.

In this paper, we present the first hybrid clocking that integrates CM clocking scheme into the global CDN to reduce the jitter-induced skew and other clock uncertainties while the local clock retains VM compatibility with the low-power CMOS logic in the rest of the chip. Specifically, the key contributions of this paper are:

- Two new current-to-voltage converter circuits
- The first hybrid (global CM and local VM) clocking methodology to create symmetric and non-symmetric clocks
- The first demonstration of jitter improvement using partial CM clocking
- The effective integration of CM clocking and VM clocking to save power

The rest of the paper is organized as follows: Section II gives a brief overview of some existing CM clocking and signaling schemes. Section III proposes our hybrid CDN. Section IV compares our hybrid CDN with existing schemes. Finally, Section V concludes the paper.

R Islam is with the Department of Electrical and Computer Engineering, University of Michigan-Dearborn, MI, 48128 USA e-mail: riaduli@umich.edu.

M Guthaus is with the Department of Computer Engineering, University of California Santa Cruz, CA 95064 USA e-mail: mrg@ucsc.edu

Copyright (c) 2018 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.



Fig. 1: The previous sense-amplifier-based current-sensing circuit suffers from interfacing issues due to the metastability of the two-output nodes and consumes a significant amount of static power [13].

#### II. OVERVIEW OF EXISTING CURRENT-MODE SIGNALING AND CLOCKING SCHEMES

In general, signaling refers to one-to-one signal transmission, while clocking refers to one-to-many signal transmission.

#### A. Existing Current-Mode Signaling Schemes

In earlier years, CM signaling was realized utilizing differential sense-amplifier-based design [13]. The primary reason was to improve the robustness of the design by eliminating common-mode noise. Unlike traditional repeater-based designs, this scheme reduces signaling power by sending differential current over the bufferless interconnect. Figure 1 shows the differential Rx circuit that enables CM operation by using two always-"ON" NMOS transistors (M1-M2). When the equalizing signal is deasserted, it breaks the metastability, and the sense-amplifier compares two complementary currents to produce a full-swing output and complementary output signals. However, at equal phase, the Rx circuit outputs could go into a metastable stage that creates an interface issue for this design. Another research study resolves the metastability issue by adding an extra sense amplifier in the output stage and reduces significant static power [15].

#### B. Existing Current-Mode Clocking Schemes

The basic building blocks of a CM clocking scheme are a current transmitter (Tx) and a current receiver (Rx) circuit. The Tx accepts a VM signal from a PLL or a frequency divider and converts it into a CM signal that it distributes to the entire CDN. In contrast, the Rx circuit accepts the input current and converts it into a VM signal for the downstream CMOS



Fig. 2: Previous CM schemes used an expensive current-pulsed FF applicable only to symmetric networks, which restrained the application of CM clocking in large-scale design due to the area overhead [7].

logic. A variation-tolerant CM scheme uses a corner-aware bias circuitry for the CM Tx along with an inverter amplifier Rx circuit that provides low impedance to the ground and holds the Rx terminal point at the switching threshold [6]. This scheme exhibits significant power-performance improvement compared to the VM scheme; however, application is limited to one-to-one signal transmission.

Recently, other researchers have proposed a CM clocking scheme for point-to-many clock networks, shown in Figure 2, which demonstrates significant power and performance improvement over traditional VM clock schemes [7]. This scheme is based on a low-power CM flip-flop (CM FF) as Rx and utilizes a NAND-NOR Tx circuit that sends a current pulse converted from a single-source VM signal. The Tx generates and transmits the current pulse, which is synchronized with the rising edge of the input VM clock signal to enable an edgetriggered operation of the Rx circuit in CM FFs. In addition to low power, this scheme shows significant noise robustness compared to existing VM clocking schemes. However, all these CM schemes neglect consideration of jitter-induced skew in the CDN. It is expected that CM schemes will be more robust related to jitter compared to the VM scheme due to the absence of buffers in the CDN, but this has not been shown.

## III. PROPOSED HYBRID CLOCK DISTRIBUTION NETWORK (HCDN)

One-to-one CM signaling schemes perform current-tovoltage conversion in the top-level clock network and do not consider the local clock pins, which may require buffers to drive the final VM FFs [6]. In contrast, the proposed hybrid clocking scheme utilizes global CM clocking to increase the robustness of the CDN against noise and jitter issues, and the final clock VM FFs are driven by buffers. The proposed hybrid scheme uses current Rx circuits to convert current to voltage and generates full-swing output voltage for the buffers. In the final analysis, we considered CM Tx, CM Rx, and buffers power for fair comparison in Section IV.

#### A. Receiver (Current-to-Voltage Converter) Circuit Design

In order to enable hybrid clocking, we propose two highperformance current-to-voltage converter circuits. These circuits act as an interface between CM and VM clocking.

1) Simple Receiver Circuit Design: The core element in the hybrid clocking scheme is an Rx circuit, shown in Figure 3(a). The simple Rx (SRx) circuit consists of a diode-connected inverter, an active-load common-source (CS) amplifier (M1–M2), an inverter amplifier (A1), and an output inverter (X1). The diode-connected inverter (Mr1–Mr2) provides a low-impedance input node for the CM clocking. The Rx can efficiently produce a 50% duty-cycle output voltage (CLK) utilizing a 50% duty-cycle input current ( $I_{in}$ ) injected at node A.

The operation of the proposed Rx can be explained using Figure 3(a) and Figure 3(b). The current signal  $(I_{in})$  from the clock network is fed to the low-impedance node A. The direction of  $I_{in}$  determines the gate-to-source voltage  $(V_{GS})$  of M1. When the driver (Tx) acts as a current source, the voltage at node A drops with the rising edge of  $I_{in}$ , which lowers the  $V_{GS}$  of M1. As a result, I2 reduces, and the constant current load (M2) increases the node B voltage. On the other hand, when Tx acts as a current Rx, the voltage at node A increases with the falling edge of  $I_{in}$ , which increases the  $V_{GS}$  of M1. As a consequence, I2 increases and the node B voltage drops by discharging the load capacitance. The A1 amplifier amplifies this voltage swing to the CMOS-logic level. The output X1 inverter helps the Rx circuit to drive the output load or the clock buffer.

The active low enable/reset (EN/RS) signal performs two functions. The most critical issue is saving static power. The  $\overline{EN}$  signal decouples  $V_{dd}$  using PMOS M3. In active mode, the Rx circuit has a 19.4 $\mu$ A leakage current, which is significantly higher than the inactive-mode leakage current of  $4.9\mu$ A. In addition, NMOS M4 is required to pull down node B to prevent unintentional output voltage swings due to noise.

2) Current-Mirror-Based Receiver Circuit Design: The current-mirror-based low-power Rx (MRx) is shown in Figure 4(a). The MRx circuit uses two reference voltage generators (Mr1–Mr4), a current-comparator (CC), inverter amplifiers  $(A_1-A_2)$ , and buffers  $(X_1-X_2)$  to generate the full-swing output voltage. It can efficiently produce a 50% duty-cycle output voltage (CLK), utilizing a 50% duty-cycle input current  $(I_{in})$  injected at node A.

In the CC, transistor M2 mirrors the reference current  $(I_{ref1})$ , and transistor M3 mirrors the combination of an input current and another reference current  $(I_{ref2} + I_{in})$ . The amplifier  $(A_1)$  compares the mirrored currents at node C, while amplifier (A2) brings the voltage to a CMOS logic level at node E. In addition, the buffers  $(X_1-X_2)$  are strong enough to drive the local clock capacitance.

Similar to the SRx circuit, the MRx circuit also uses an active low  $\overline{EN}/\overline{RS}$  signal to perform two functions. In active mode, the MRx circuit has a  $36.4\mu A$  leakage current, which is significantly higher than the inactive-mode leakage current of 1.99uA.

Clock gating, in which clock switching to sequential elements is restrained to reduce unnecessary dynamic power, is considered one of the most widely adopted low-power techniques. Clock gating can include fine-grained to coarse clock gating by disabling a small group of registers, a cluster



(a) Unlike the previously reported current-pulsed FF [11], the proposed hybrid clocking uses a current-to-voltage converter circuit that we refer to as simple Rx (SRx) to generate a 50% duty cycle voltage pulse to drive the downstream buffered network.



(b) Simulation waveforms confirm the current-to-voltage (50% duty cycle) CLK generation.

Fig. 3: Proposed SRx and simulation results.

of registers in a module, or an entire functional unit [16]. The Rx circuit reset pin  $(\overline{RS})$  can easily shut off certain portions of the CDN to perform clock gating. In addition, using  $\overline{RS}$  in a hybrid clock network reduces the possibility of entering a meta-stability condition and increases the robustness of the design.

#### B. Hybrid Clock Distribution and Transmitter Circuit Design

The proposed hybrid (global current-mode and local voltage-mode) CDN (HCDN) scheme is shown in Figure 5. It has a pulsed-current Tx at the root of the clock, a bufferless global network, an Rx circuit as introduced in Section III-A, and a buffered local clock network to drive sequential elements. The Tx receives a conventional VM clock from a PLL/frequency divider at the root of the clock tree and supplies a pulsed current to the interconnect. The clock network is held at a nearly constant voltage. Since a symmetric clock tree has equal impedances in each branch, the current is distributed equally to each Rx circuit. In addition, the local clock network in high-performance microprocessors often uses a clock grid



(a) Unlike the previously reported current-pulsed FF [11], the proposed hybrid clocking uses a current-mirror-based Rx (MRx) as the current-to-voltage converter to generate a 50% duty cycle voltage pulse to drive the downstream buffered network.



(b) Simulation waveforms confirm the expected current-to-voltage (50% duty cycle) CLK generation.

Fig. 4: Proposed MRx circuit and simulation results.

to tackle the skew issue. In that case, we can easily employ one Rx circuit for each sector network by dividing the whole network into uniform sector clocks [17].

Unlike previous CM clocking schemes [6], [7], the hybrid scheme requires a simple current Tx as shown in Figure 5. The current Tx is a weak 4-transistor driver that drives the CM bufferless global network. The auxiliary PMOS (M1) and NMOS (M4) limit the peak current and reduce the interconnect voltage swing. However, we need to properly size the Tx (M1–M4) to ensure that it can provide an alternating pulsed current into the clock network and distribute the required amount of current to each Rx circuit. The buffers in the Tx are strong enough to drive the M2–M3 transistor pair.

In the Tx circuit, M1 and M4 are in saturation mode due to the gate-drain connections. On the falling edge of the input



Fig. 5: The proposed hybrid clocking scheme utilizes a single buffered voltage-to-current converter that drives the bufferless global network and distributes an equal amount of current to each current-to-voltage converter Rx circuit, while the Rx circuit and the buffers drive the local network and final registers.

clock signal, M2 is "ON" and M3 "OFF," and the Tx sends a "push" current. On the other, when the clock signal is high, M3 is "ON" and M2 "OFF," and the Tx sinks current from the network, resulting in a "pull" current. The sizing of M2–M3 determines the near-constant voltage of the interconnect and sets the required biased voltage of the Rx circuit.

The proposed hybrid clocking scheme uses a single Tx driver at the root of the global CM clock network, and the critical root wire carries the total current that is distributed to all the branches. Hence, the sizing of the global clock network wires that determine the wire resistance is critical for both reliability and performance and to ensure undistorted input current for each Rx. At the same time, determination of wire width must also consider electromigration effects while carrying the total current at the root.

The Rx circuit receives the alternating current from the Tx and converts it into a full-swing voltage CLK signal as shown in Figure 3(b). The local VM clock network is buffered and optimized for an output CLK signal with a slew of less than 10% of the clock period, which is considered to be the typical slew-rate bound in a high-performance clock network design. The primary reason is to reduce susceptibility to variation, reduce the effect of clock slew on setup/hold constraints, and decrease short-circuit power consumption [18].

#### C. Hybrid Clock Distribution Methodology

Existing clock synthesis tools work only with VM clock signals. A design methodology for hybrid clock networks can utilize existing VM algorithms in the local networks, but the clustering and routing of CM clocks to form a hybrid network can improve the global clock network robustness and power. However, it is necessary to identify the pivotal steps of a hybrid



Fig. 6: For any given network, the proposed hybrid clocking scheme creates a local VM buffered clock network and a global CM bufferless network and finally combines the CM and VM clock networks to implement HCDN.

CDN generation scheme to measure the design complexity and compatibility of the proposed technique compared to the existing VM techniques. The proposed hybrid CDN generation methodology is shown in Figure 6. It takes any given network as input and generates a global CM and local buffered VM CDN or hybrid CDN.

First of all, the hybrid methodology clusters the given network based on the sinks' Cartesian coordinates. For this, we use a k-means clustering algorithm. The basic idea is that for a given number of sinks  $(xs_1, xs_2, ..., xs_n)$ , the algorithm tries to partition the *n* sinks into  $k \leq n$  sets/clusters  $S = \{S_1, S_2, ..., S_k\}$  and tries to minimize the within-cluster sum of squares. This can be mathematically expressed as

$$\frac{argmin}{S} = \sum_{i=1}^{k} \sum_{xs_j \in S_i} ||xs_j - \mu_i||^2$$
(1)

where  $\mu_i$  is the mean of the points within the cluster  $S_i$ . The k-means clustering algorithm identifies the centroids of each cluster. Then the methodology divides the flow into two parallel paths, as shown in Figure 6. The left path generates the global CM network, while the right path generates the local buffered VM network. After constructing both the CM and VM networks, we combine them by connecting the roots of each cluster to the outputs of the corresponding CM Rx circuit to build the proposed hybrid CDN.

Algorithm 1 summarizes our proposed HCDN generation methodology. As input, the algorithm takes a clock tree (Tree) and the slew-constraint  $(S_L)$ . The output of the algorithm is an HCDN. The algorithm starts with clustering methodology that divides the given network into a global tree and a local tree in Line 4. Similar to timing-model-independent buffered clocktree synthesis (BCTS) [19], the local tree uses the common connection length to each cluster to build an equal-height tree (EHT) and then buffers the local VM network to meet the slew-constraint in Line 6–Line 7. As a result, the local tree has a common insertion delay. The global tree is an equalimpedance CM network that assigns equal height to each level using the centroids of the clusters (Line 9). Then the algorithm places an Rx circuit into each centroid, computes the total admittance of the network for initial Tx sizing [20], and runs the transient simulation to extract initial skew in Line 10– Line 14. For total admittance computation, we consider the input admittances of the global CM network and Rx circuits. Then the algorithm recursively sizes up or down from the Tx initial sizing ( $T_{init}$ ) to extract the lowest or the best skew in Line 15–Line 22. Similar to any CMOS sizing, the Tx sizing problem is convex, and for the increment or decrement we use sizing step  $\delta s = 1\%$  of  $T_{init}$ . Then we build our proposed HCDN by connecting the roots of each cluster to the outputs of the corresponding CM Rx circuit to extract the powerperformance of the network in Line 23.

The major advantage of the proposed hybrid CDN generation methodology is that it can be applicable to both symmetric and asymmetric networks.

| Alş | gorithm 1 HCDN generation methodology                                                           |
|-----|-------------------------------------------------------------------------------------------------|
| 1:  | <b>Input:</b> clock tree (Tree), slew-constraint $(S_L)$ ;                                      |
| 2:  | Output: Hybrid CDN;                                                                             |
| 3:  |                                                                                                 |
| 4:  | (GloTree(GTI), LocTree(LT)) = Clustering(Tree)                                                  |
| 5:  | //Local buffered VM network                                                                     |
| 6:  | $BCTS = EHT(LT)$ $\triangleright$ equal-height tree generation                                  |
| 7:  | $BufferSizing(BCTS)$ $\triangleright$ buffer sizing to meet $S_L$                               |
| 8:  | //Global CM network                                                                             |
| 9:  | $GT = EHT(GTI)$ $\triangleright$ global CM tree generation                                      |
| 10: | RxPlacement(GT)                                                                                 |
| 11: | $Y_T^G = TotalAdmittance(GT)$                                                                   |
| 12: | $T_{init} = SizeTransmitter(Y_T^G)$                                                             |
| 13: | TransientSimulation()                                                                           |
| 14: | $S_{ipit}^{G} = CalculateSkew()$                                                                |
| 15: | $S_{new}^G = S_{best}^G = S_{init}^G, T_{best} = T_{newUp} = T_{newDown} = T_{init}$            |
| 16: | while $S_{new}^G \leq S_{best}^G$ do $\triangleright$ repeat if improvement or equal            |
| 17: | Recursively size up $(T_{newUp} = T_{newUp} + \delta s)$ and                                    |
|     | extract $S_{best}^G$ and $T_{best} \triangleright \delta s$ is the 1% of $T_{init}$ , sizing up |
| 18: | end while                                                                                       |
| 19: | $S_{new}^G = S_{init}^G$                                                                        |
| 20: | while $S_{new}^G \leq S_{best}^G$ do $\triangleright$ repeat if improvement or equal            |
| 21: | Recursively size down $(T_{newDown} = T_{newDown} - \delta s)$                                  |
|     | and extract $S_{best}^G$ and $T_{best}$ $\triangleright$ sizing down                            |
| 22: | end while                                                                                       |
| 23: | HCDN = Combine(GT, LT)                                                                          |

#### IV. EXPERIMENTS

#### A. Experimental Setup

The proposed hybrid clocking scheme is implemented in a FreePDK 45nm CMOS technology design kit [21]. The current Tx and Rx circuits' layouts are compatible with a standard cell library height of twelve horizontal metal-2 tracks. The performance of the circuits was evaluated using HSPICE simulation at clock frequencies from 1 to 2GHz and a 1V supply voltage. In addition, we considered 10% of clock period, which is a 100ps and 50ps slew bound for 1GHz and 2GHz clock frequencies, respectively.

#### B. Proposed Receiver Circuits Analysis

In order to measure the robustness of the proposed Rx circuits, we performed Monte-Carlo simulation. In our Monte



Fig. 7: We verify the robustness of the proposed SRx using Monte Carlo simulations under (a)  $I_{in}$  variation and (b) process variation-induced threshold voltage variation.



Fig. 8: We verify the robustness of the proposed MRx using Monte Carlo simulations under (a)  $I_{in}$  variation and (b) process variation-induced threshold voltage variation.

Carlo analysis, we introduced transistor threshold voltage variation using a Gaussian distribution function. The distribution function used 10% absolute variation with three standard deviations from the nominal value. In addition, we verified the robustness of the Rx circuit by considering 50%  $I_{in}$ variation with three standard deviations from the nominal value (4.5 $\mu$ A). Figure 7(a) and Figure 7(b) show the Monte Carlo simulations of the Rx under  $I_{in}$  variation and under process variation, respectively. The proposed SRx has 44.5ps mean ( $\mu$ )  $I_{in}$ -to-CLK delay with 2.2ps standard deviation ( $\sigma$ ) under process variation.

Similar to the SRx, we verified the robustness of the MRx circuit using Monte Carlo simulations under the same conditions. Figure 8(a) and Figure 8(b) show the Monte Carlo simulations of the Rx under  $I_{in}$  variation and under process variation, respectively. The proposed MRx has 71.6ps mean ( $\mu$ )  $I_{in}$ -to-CLK delay with 4.3ps standard deviation ( $\sigma$ ) under process variation. According to this analysis, the MRx has slightly higher variability ( $\frac{3\sigma}{\mu} = 0.18$ ) [22] compared to the SRx circuit, which has a variability of 0.15 under threshold voltage variation. In addition, under the same conditions, the MRx has  $1.6 \times$  more  $I_{in}$ -to-CLK delay compared to the SRx circuit, resulting in  $1.6 \times$  more global clock latency.

#### C. Proposed Hybrid Clocking Analysis Using H-Tree Networks

The total power consumption of the hybrid clocking scheme includes both the global CM CDN power and the local VM CDN power. The global network power includes the CM Tx power, the CDN interconnects power, and the Rx circuits power. The local VM CDN power includes the interconnect power and the buffers power. For this power consumption analysis, we have considered 2-level to 10-level H-tree networks with 16 to 1024 sinks, respectively. The sinks are evenly distributed in a 0.96mm to 7.69mm area. To facilitate normal Rx operation, we used an active low  $\overline{EN/RS}$  signal and also included the required routing power in the hybrid clocking power calculation.

In order to have a fair comparison, we used the same Htree clock networks for both the hybrid and the traditional VM buffered clocking schemes. However, for the traditional VM scheme, the global CDN is driven by buffers instead of by CM Tx.

Table I shows the total number of buffers (VM buffered system), total number of buffers and Rxs (hybrid system), and total power consumption ( $P_T$ ) of the traditional VM and proposed hybrid CDN simulation at clock frequencies ranging from 1 to 2GHz. Clearly, the proposed hybrid clocking scheme consumes less power than the synthesized traditional buffered clocking scheme for all sizes of CDN at different frequencies. This is primarily due to the large dynamic power consumption in the global VM CDN and the full voltage swing in the VM networks compared to the negligible voltage swing in the hybrid scheme global CDN.

The traditional VM scheme requires 44 to 1398 global and local buffers to drive different-sized CDNs, while hybrid clocking does not require buffers in the global CDN. As a result, hybrid clocking requires 32 to 1024 local buffers to drive the local CDN. In addition, the hybrid scheme uses 4 to 256 Rx circuits for different hybrid networks. The proposed SRx-based hybrid system consumes up to 19% lower average power compared to a VM scheme with 16 to 1024 clock sinks at 1GHz clock frequency. For the same networks and clock frequency, the proposed scheme consumes up to 20% lower average power compared to the VM buffered scheme. At 2GHz clock frequency, the proposed SRx-based hybrid system consumes up to 36% lower average power compared to a VM scheme with 16 to 1024 clock sinks. For the same networks and clock frequency, the proposed MRx-based scheme consumes up to 37% lower average power compared to the VM buffered scheme. Overall, the proposed SRx-based clocking scheme consumes 15% and 29% lower average power than the synthesized buffered VM scheme at 1GHz and 2GHz clock frequency, respectively. Using MRx, the proposed clocking scheme consumes 17.4% and 31% lower average power than the synthesized buffered VM scheme at 1GHz and 2GHz clock frequency, respectively.

#### D. Effect of Temperature Variation

In scaled technology, temperature variation can significantly degrade the performance of a CDN. In order to quantify

| Freq.<br>(GHz) | # of<br>sinks | Chip-<br>edge (mm) | VM system    |            |         | SRx Hyl      | orid system |             | MRx Hybrid system |              |            |             |
|----------------|---------------|--------------------|--------------|------------|---------|--------------|-------------|-------------|-------------------|--------------|------------|-------------|
|                |               |                    | # of buffers | $P_T (mW)$ | # of Rx | # of buffers | $P_T (mW)$  | % of saving | # of Rx           | # of buffers | $P_T (mW)$ | % of saving |
| 1              | 16            | 0.96               | 44           | 1.7        | 4       | 32           | 1.5         | 10.5        | 4                 | 32           | 1.46       | 14.1        |
|                | 64            | 1.92               | 110          | 6.5        | 16      | 64           | 5.6         | 13.8        | 16                | 64           | 5.44       | 16.3        |
|                | 256           | 3.84               | 374          | 26.0       | 64      | 256          | 21.8        | 16.2        | 64                | 256          | 21.2       | 18.6        |
|                | 1024          | 7.69               | 1398         | 145.5      | 256     | 1024         | 118.4       | 18.6        | 256               | 1024         | 115.8      | 20.4        |
| 2              | 16            | 0.96               | 44           | 3.5        | 4       | 32           | 2.7         | 21.7        | 4                 | 32           | 2.64       | 24.6        |
|                | 64            | 1.92               | 110          | 12.9       | 16      | 64           | 9.6         | 25.6        | 16                | 64           | 9.4        | 27.4        |
|                | 256           | 3.84               | 374          | 46.6       | 64      | 256          | 31.3        | 32.9        | 64                | 256          | 30.3       | 34.9        |
|                | 1024          | 7.69               | 1398         | 265.9      | 256     | 1024         | 171.0       | 35.7        | 256               | 1024         | 167.1      | 37.1        |

TABLE I: The proposed hybrid clocking scheme consumes up to 20% and 37% lower power compared to the buffered VM clocking at 1GHz and 2GHz clock frequency, respectively, using 16 to 1024 sinks networks.



Fig. 9: The proposed hybrid system exhibits 5.6% lower temperature-variation-induced skew compared to the VM system: (a) a 2-level 16-sinks H-tree testbench for hybrid CDN and (b) the same 16-sinks H-tree testbench for buffered VM system to perform temperature variation analysis.

the temperature-variation-induced clock skew, we used a 2level 16-sinks H-tree network. Figure 9(a) shows the proposed hybrid CDN testbench, where the left branch of the Htree network is performing at  $25^{\circ}C$  and the right branch is performing at  $125^{\circ}C$ . For the VM system, we used the same network but with CM Rxs and Tx replaced with the buffers as shown in Figure 9(b). According to our analysis, the VM system has 5.6% more skew compared to the hybrid system. The primary reason is the absence of buffers in the global CM network.

#### E. Effect of Supply Voltage Variation

One of the major sources of variation in modern microprocessor performance is supply voltage variation. In this analysis, we used the 16-sinks H-tree network for both the hybrid and the VM systems. In addition, we considered a  $\pm 10\%$  supply voltage ( $V_{dd}$ ) variation from the nominal  $V_{dd}$ . The VM system has 21ps to 30ps skew due to the  $\pm 10\%$   $V_{dd}$  variation. For the same voltage fluctuation range, the proposed hybrid system exhibits 21ps to 22ps skew. The hybrid system exhibits slightly better performance due to the absence of buffers in the CM network.

#### F. Jitter Analysis

In general, clock jitter can represent timing jitter, period jitter, and cycle-to-cycle jitter. Timing jitter can be defined as

the time difference between actual and ideal signal transition. Period jitter is the time deviation of the clock period from its average value. Cycle-to-cycle jitter is the duty cycle variation of two consecutive clock periods. However, all of these jitters are mathematically related and can be expressed as cycle-tocycle jitter [23], and in our experiments we consider only cycle-to-cycle jitter. We compute jitter by introducing a fixed timing variation at the root of each tree network. Using a 16sinks network, the proposed hybrid clocking has 33% lower jitter-induced delay variation compared to the VM buffered scheme. Better yet, when using a 1024-sinks network, the proposed scheme has 54% lower delay variation compared to the VM scheme due to the absence of buffers in the global CM CDN in the hybrid clocking scheme.

#### G. Study on ISPD Clock Networks and ISCAS89 Circuit

In order to validate the proposed hybrid clocking scheme in a non-symmetric network, we use an ISPD 2009 testbench circuit (s1r1 with 81 sinks) [24]. The network is extracted from IBM ASIC design and distributed in  $69.4mm^2$ . The primary goal of this experiment is to determine the optimal cluster size and Rx placement. For this, we use the k-means clustering algorithm-based hybrid CDN generation methodology explained in Section III-C. Figure 10 shows the resulting HCDN (EHT-based local buffered and bufferless CM CDN) for the ISPD 2009 benchmark circuit s1r1.

For a fair comparison, we use a synthesized industry standard buffered VM network routed with minimum wire length [25]–[27]. The buffers are inserted to have minimal skew and minimum slew constraint. Both of the schemes consider 10% of clock period as the slew constraint in the final clock output. In addition, we implement a deferred merge-embedding (DME)-based CM clocking (global and local CM tree) scheme [20].

We performed a wide range of simulations on ISPD 2009 clock networks to identify the optimal number of receivers or the number of clusters. The findings of this experiment are very interesting and pave a critical implementation direction for the hybrid clocking scheme, as shown in Table II. Using 4 sinks per cluster and 20 SRx circuits, the proposed hybrid consumes 43.1% lower power compared to the buffered VM scheme. However, it nearly doubles the clock skew (41ps)

TABLE II: Clearly, 5 sinks/cluster enables maximum power saving; using this clustering method on the ISPD 2009 s1r1, ISPD 2010 01.in, and ISCAS89 s5378 circuits, the proposed MRx-based hybrid clocking saves up to 50% and 59% lower power compared to the synthesized buffered VM clocking scheme at 1GHz and 2GHz clock frequency, respectively.

| Fraquancy | Banchmark       | CM system [20] |            | VM system |            | Proposed Hybrid system |         |           |                |                 |                |                 |  |
|-----------|-----------------|----------------|------------|-----------|------------|------------------------|---------|-----------|----------------|-----------------|----------------|-----------------|--|
| (GHz)     | Benchinark      | Skew (ps)      | $P_T (mW)$ | Skew (ps) | $P_T (mW)$ | # of sinks/clus.       | # of Rx | Skew (ps) | SRx $P_T$ (mW) | SRx % of saving | MRx $P_T$ (mW) | MRx % of saving |  |
|           |                 |                |            | 14        | 34.8       | 5                      | 16      | 21        | 20.2           | 41.9            | 20.0           | 42.4            |  |
|           | ISPD 2009 s1r1  | 21             | 6.0        |           |            | 10                     | 8       | 10        | 32.6           | 6.3             | 32.5           | 6.6             |  |
|           |                 |                |            |           |            | 20                     | 4       | 6         | 42.6           | -18.3           | 42.6           | -18.2           |  |
|           |                 |                | 60.5       | 32        | 107.7      | 5                      | 220     | 61        | 68.3           | 36.6            | 66.1           | 38.6            |  |
| 1         | ISPD 2010 01.in | 43             |            |           |            | 10                     | 110     | 55        | 102.5          | 4.8             | 101.4          | 5.8             |  |
|           |                 |                |            |           |            | 20                     | 55      | 31        | 150.8          | -28.6           | 150.3          | -28.3           |  |
|           |                 | 28             | 8.8        | 18        | 9.3        | 5                      | 35      | 12        | 5.0            | 46.0            | 4.7            | 49.8            |  |
|           | ISCAS89 s5378   |                |            |           |            | 10                     | 17      | 10        | 5.3            | 42.5            | 5.2            | 44.5            |  |
|           |                 |                |            |           |            | 20                     | 9       | 5         | 42.8           | -78.3           | 42.7           | -78.0           |  |
|           |                 |                |            | 13        | 69.7       | 5                      | 16      | 11        | 38.2           | 45.2            | 38.0           | 45.5            |  |
|           | ISPD 2009 s1r1  | 16             | 7.8        |           |            | 10                     | 8       | 7         | 64.0           | 8.2             | 63.9           | 9.0             |  |
|           |                 |                |            |           |            | 20                     | 4       | 3         | 103.4          | -32.6           | 103.3          | -32.6           |  |
|           |                 |                | 83.9       | 87        | 251.7      | 5                      | 220     | 48        | 145.0          | 42.4            | 141.7          | 43.7            |  |
| 2         | ISPD 2010 01.in | 41             |            |           |            | 10                     | 110     | 39        | 215.3          | 14.5            | 213.7          | 15.1            |  |
|           |                 |                |            |           |            | 20                     | 55      | 38        | 336.4          | -25.2           | 335.6          | -25.0           |  |
|           |                 |                |            | 18        | 18.3       | 5                      | 35      | 14        | 8.1            | 55.6            | 7.6            | 58.6            |  |
|           | ISCAS89 s5378   | 21             | 11.5       |           |            | 10                     | 17      | 11        | 8.8            | 52.2            | 8.5            | 53.3            |  |
|           |                 |                |            |           |            | 20                     | 9       | 7         | 59.1           | -69.1           | 59.0           | -69.0           |  |



Fig. 10: The resulting HCDN for the ISPD 2009 benchmark circuit s1r1.





Fig. 11: Using 4 sinks per cluster saves slightly more power; however, it nearly doubles the clock skew compared to the 5 sinks per cluster scheme.

expenditure of extra power. The results are identical using an MRx circuit. The MRx-based design consumes 42.4% lower power compared to the buffered VM scheme. At 1GHz clock frequency, the standalone CM clocking consumes much lower power than the proposed hybrid schemes with similar skew. However, its performance depends on expensive additional HSPICE simulations, which we will discuss in detail in Section IV-J.

As expected, at 2GHz clock frequency, the proposed SRxbased scheme consumes 8% and 45% lower power compared to the VM scheme using 10 sinks per cluster and 5 sinks per cluster, respectively. The primary source of extra power saving at higher frequencies is that the increase of CM clocking power proportional to frequency is much lower than for a VM scheme [11]. In addition, at 2GHz clock frequency using 5 sinks per cluster, the hybrid scheme has 2ps skew improvement compared to the VM scheme, as shown in Table II. Using an MRx circuit, the proposed hybrid scheme consumes up to 42% and 46% lower power at 1GHz and 2GHz clock frequency, respectively. Using 5 sinks per cluster, the proposed MRx-based hybrid clocking has 31% lower skew compared to the existing CM scheme [20]. Clearly, 5 sinks per cluster is optimal and enables the best power efficiency for hybrid clocking.

We also apply the proposed hybrid clocking scheme on an ISPD 2010 (01.in network) [29]. The ISPD 2010 testbenches have more sink density compared to the ISPD 2009 testbenches. The 01.in clock network consists of 1107 sinks distributed in a  $64mm^2$  area. The results of this experiment are shown in Table II and are identical to our previous experiments.

Using 5 sinks per cluster and 220 SRx circuits, the proposed hybrid consumes 68.3mW total power (CM network power + VM buffered network power), which is 36% lower compared to the traditional VM scheme. In addition, the proposed SRx-based scheme saves 5% average power compared to the VM scheme using 10 sinks per cluster. However, using 20 sinks per cluster, the proposed scheme consumes 29% more power compared to the VM scheme. At 1GHz clock frequency, using 5 sinks per cluster and 220 SRx circuits, the proposed MRx-based hybrid scheme consumes 39% lower power than the buffered VM scheme. At 1GHz clock frequency, the CM clocking consumes 8.5% lower power than the proposed hybrid scheme; however, it requires 1107 CM Rxs/FFs compared to our 220 Rx circuits.

Using the more dense testbench at 2GHz clock frequency, the proposed SRx-based scheme consumes 42% and 15% lower power compared to the VM scheme using 5 sinks per cluster and 10 sinks per cluster, respectively. However, the SRx-based hybrid scheme consumes 34% more power compared to the VM scheme using 20 sinks per cluster and 55 SRx circuits. On the other hand, at 2GHz clock, the MRx-based scheme consumes 44% and 15% lower power compared to the VM scheme using 5 sinks per cluster and 10 sinks per cluster, respectively. Better yet, at 2GHz clock frequency using 5 sinks per cluster, the hybrid scheme has 39ps skew improvement compared to the VM scheme, as shown in Table II.

In order to validate the proposed hybrid clocking scheme, we use an ISCAS89 testbench circuit (s5378) [28]. The s5378 has 179 D-type flip-flops (sinks) and 2779 gates. The primary goal of this experiment is to determine the optimal cluster size and compare the results with the ISPD testbenches results.

Table II compares the total power consumption of the proposed hybrid scheme and the buffered VM scheme using the s5378 circuit. Clearly, the results are identical to our previous experiments. According to our analysis, the proposed MRx-based hybrid scheme saves up to 50% and 59% average

power compared to the synthesized VM scheme at 1GHz and 2GHz clock frequency, respectively. Considering the above discussion and results it is apparent that the hybrid clocking methodology is more energy efficient when we use 5 sinks per cluster compared to the other clustering strategies.

The proposed MRx-based hybrid scheme saves up to 47% and 34% average power compared to the standalone CM scheme at 1GHz and 2GHz clock frequency, respectively. The primary reason is that the sink density and the total power consumption of the existing CM scheme are primarily dominated by the total power consumption of the Rx circuits [20]. The sink density of s1r1, 01.in, and s5378 are 1.2 sinks/ $mm^2$ , 17.3 sinks/ $mm^2$ , and 75.5 thousand sinks/ $mm^2$ , respectively. In addition, at 1GHz clock frequency using 5 sinks per cluster, the MRx-based hybrid scheme has 16ps skew improvement compared to the CM scheme, as shown in Table II.

The power breakdown of the proposed hybrid clocking scheme at 1–2GHz clock frequency is shown in Figure 12. Clearly, the power consumption of global CM clocking increases proportionally with the decrease in the number of sinks per cluster. On an ISPD 2009 s1r1 network, the CM clocking using 5 sinks per cluster consumes 62% and 51% more power compared to the 20 sinks per cluster methodology at 1GHz and 2GHz clock frequency, respectively. The results are identical on an ISPD 2010 01.in network, as shown in Figure 12(a). The increase in power consumption with the decrease in number of sinks per cluster is due to the increase in the number of CM Rx circuits.

We observed completely opposite results for the hybrid clocking VM network, as shown in Figure 12(b). The power consumption of local VM clocking decreases proportionally with the decrease in the number of sinks per cluster. On an ISPD 2010 01.in network, the VM clocking using 5 sinks per cluster consumes 64% and 62% lower power compared to the 20 sinks per cluster methodology at 1GHz and 2GHz clock



Fig. 12: The hybrid clocking power breakdown: (a) the power consumption of global CM clocking increases proportionally with the decrease of the number of sinks per cluster, and (b) the power consumption of local VM clocking decreases proportionally with the decrease of the number of sinks per cluster.



Fig. 13: The proposed hybrid clocking consumes 17.6% to 24.3% lower area compared to the existing CM clocking scheme using ISPD 2009 s1r1, ISPD 2010 01.in, and ISCAS89 s5378 networks.

frequency, respectively. The results are identical on ISPD 2009 s1r1 and ISCAS89 s5378 networks. The reduction in power consumption using the smaller number of sinks per cluster is due to the smaller local VM CDN.

#### H. Active Area Comparison

In the proposed hybrid clocking, we use global bufferless CM clocking and local buffered VM clocking. The total active silicon area of the proposed clocking compared with the standalone CM clocking [20] and the conventional buffered VM clocking is shown in Figure 13. The proposed HCDN includes the global CM Tx area, the local buffers area, and the VM master-slave D FFs area. The standalone CM clocking includes the CM Tx area and the final CM Rx/FFs area. The buffered VM clocking includes the buffers area and the VM masterslave D FF area. The proposed hybrid clocking consumes 17.6% to 24.3% lower area compared to the existing CM clocking scheme using ISPD 2009, ISPD 2010, and ISCAS89 networks. The area overhead in CM clocking is primarily due to the large CM FF area (7.96  $\mu m^2$ ) [20], compared to our proposed CM MRx area (4.48  $\mu m^2$ ). However, the proposed clocking consumes 5% to 21% more area compared to the VM scheme. This is primarily due to the CM Rx circuits. On the other hand, the standalone CM clocking consumes 28% to 36% more area compared to the VM buffered scheme.

#### I. Clock Networks Skew Variation

In order to address the process variation, we ran Monte Carlo simulations of ISPD 2009, ISPD 2010, and ISCAS89 testbenches. In this analysis, we considered transistor threshold voltage variation using a Gaussian distribution function that uses 10% absolute variation with three standard deviations from the nominal value. The results of this analysis, mean clock skew ( $\mu$ ), standard deviation ( $\sigma$ ), and maximum skew ( $\mu$  + 3 $\sigma$ ), are shown in Table III considering 1000 iterations.



Fig. 14: The hybrid clocking scheme has up to  $10 \times$  and  $2.7 \times$  less runtime compared to the traditional buffered VM and bufferless CM algorithms, respectively.

Table III also lists the deterministic skew (d) of each CM, VM, and proposed MRx-based hybrid clocking scheme. At 1 GHz clock frequency, the proposed hybrid clocking has 44.3ps maximum skew and the buffered VM scheme has 46.4ps maximum skew using an s1r1 circuit. In addition, the existing CM clocking has more variability ( $\frac{3\sigma}{\mu} = 0.25$ ) compared to the proposed hybrid scheme's variability of 0.18. This is primarily due to the additional transistors in the CM Rx/FF circuits.

#### J. Runtime

The proposed algorithm has up to  $10 \times$  less runtime compared to the traditional buffered VM algorithm using ISPD 2009, ISPD 2010, and ISCAS89 testbenches, as shown in Figure 14. The primary reason is that the runtime is dominated by the HSPICE simulation, which is also the *de facto* standard methodology for high-performance microprocessor design. In addition, the proposed HCDN has up to  $2.7 \times$  less runtime compared to the existing CM algorithm using ISPD 2009, ISPD 2010, and ISCAS89 testbenches, as shown in Figure 14. The primary reason is that in addition to the Tx sizing algorithm, the existing CM scheme requires the expensive CM FF sizing algorithm to balance skew [20].

Since the CM and VM CDN generation can work in tandem (see Figure 6), the proposed methodology does not have any timing overhead.

In summary, the proposed HCDN can save a significant amount of clock power and exhibits better robustness (considering temperature variation,  $V_{dd}$  variation, jitter-induced skew, clock skew) compared to the VM buffered scheme, as shown in Table IV.

#### V. CONCLUSION

We have presented the first low-jitter hybrid clocking scheme by integrating global bufferless CM clocking and local buffered VM clocking. When we apply the hybrid scheme in a symmetric H-tree network, the hybrid scheme saves 15%

TABLE III: Applying a  $\sim 5$  sinks/cluster strategy in 1000 Monte Carlo iterations on the s1r1 circuit, the proposed MRx-based hybrid clocking scheme has more deterministic and average clock skew compared to the synthesized buffer scheme; however, the maximum skew is well within the conventional skew bound (100ps @ 1GHz).

| Benchmark |                        | СМ         | system [      | 20]                    | Buffered VM system |            |               |                        | Hybrid system          |            |               |                        |
|-----------|------------------------|------------|---------------|------------------------|--------------------|------------|---------------|------------------------|------------------------|------------|---------------|------------------------|
|           | <i>d</i> ( <i>ps</i> ) | $\mu$ (ps) | $\sigma$ (ps) | $(\mu + 3\sigma) (ps)$ | d (ps)             | $\mu$ (ps) | $\sigma$ (ps) | $(\mu + 3\sigma) (ps)$ | <i>d</i> ( <i>ps</i> ) | $\mu$ (ps) | $\sigma$ (ps) | $(\mu + 3\sigma) (ps)$ |
| s1r1      | 21                     | 39.2       | 3.24          | 48.9                   | 14                 | 28.8       | 5.90          | 46.4                   | 21                     | 37.6       | 2.25          | 44.3                   |
| 01.in     | 43                     | 64.5       | 6.1           | 82.7                   | 32                 | 46.5       | 6.29          | 65.4                   | 61                     | 61.5       | 3.40          | 71.7                   |
| s5378     | 28                     | 33.4       | 2.47          | 40.8                   | 18                 | 21.9       | 2.28          | 28.7                   | 12                     | 15.1       | 1.83          | 20.6                   |

TABLE IV: In summary, the proposed HCDN can save a significant amount of clock power and exhibits better robustness (considering temperature variation,  $V_{dd}$  variation, jitter-induced skew, clock skew) compared to the VM buffered scheme.

| Benchmark           | Analysis                    | VM    | HCDN  | %    |
|---------------------|-----------------------------|-------|-------|------|
| H-tree (16 sinks)   | Power $\alpha$ 2GHz (mW)    | 3.5   | 2.64  | 24.6 |
| H-tree (1024 sinks) | Power $\alpha$ 2GHz (mW)    | 265.9 | 167.1 | 37.1 |
| 01.in (1107 sinks)  | Power $\alpha$ 2GHz (mW)    | 251.7 | 141.7 | 43.7 |
| s5378 (179 sinks)   | Power $\alpha$ 2GHz (mW)    | 18.3  | 7.6   | 58.6 |
| H-tree (16 sinks)   | Temp. var. skew (ps)        | 82.0  | 77.4  | 5.6  |
| H-tree (16 sinks)   | 10% $V_{dd}$ var. skew (ps) | 21-30 | 21-22 | 33.3 |
| H-tree (1024 sinks) | Jitter-induced skew (ps)    | 22.0  | 10.1  | 54.1 |
| s1r1 (81 sinks)     | Skew (ps) $\alpha$ 2GHz     | 13.0  | 11.0  | 15.4 |
| 01.in (1107 sinks)  | Skew (ps) $\alpha$ 2GHz     | 87.0  | 48.0  | 44.8 |
| 01.in (1107 sinks)  | Runtime (hr)                | 8.0   | 0.8   | 90   |

and 29% average power compared to a synthesized buffered VM scheme at 1GHz and 2GHz clock frequency, respectively. In addition, the hybrid scheme exhibits 33% and 54% lower jitter-induced delay variation compared to the VM scheme for 16-sink and 1024-sink networks, respectively. In addition, the proposed scheme exhibits 5.6% lower temperature-variation-induced skew compared to the VM system, using a 16-sink H-tree network.

The proposed HCDN methodology consumes lower power and offers improved robustness compared to the existing synthesized VM methodology. However, the power efficiency is strongly influenced by the optimal number of sinks/cluster and Rx placement, as shown in Table II–Table IV. In this paper, we have also identified the optimal cluster size for hybrid clocking using ISPD 2009, ISPD 2010, and ISCAS89 testbench circuits. The optimal number of sinks/cluster is 5. Using this clustering technique, the proposed scheme consumes up to 59% lower power compared to the synthesized VM scheme, with better or comparable skew. Better yet, the hybrid methodology uses  $10 \times$  less run time compared with the buffered VM networks.

#### REFERENCES

 K. A. Jenkins and K. L. Shepard and Z. Xu, "On-Chip Circuit for Measuring Period Jitter and Skew of Clock Distribution Networks," *CICC*, September 2007, pp. 157–160.

- [2] J. D. Schaub and F. H. Gebara and T. Y. Nguyen and I. Vo and J. Pena and D. J. Acharyya, "On-chip jitter and oscilloscope circuits using an asynchronous sample clock," *ESSCIRC*, September 2008, pp. 126–129.
- [3] K. Niitsu and M. Sakurai and N. Harigai and T. J. Yamaguchi and H. Kobayashi, "CMOS Circuits to Measure Timing Jitter Using a Self-Referenced Clock and a Cascaded Time Difference Amplifier With Duty-Cycle Compensation," JSSC, vol. 47, no. 11, pp. 2701–2710, November 2012.
- [4] M. Omana and D. Rossi and D. Giaffreda and C. Metra and T. M. Mak and A. Rahman and S. Tam, "Low-Cost On-Chip Clock Jitter Measurement Scheme," *TVLSI*, vol. 23, no. 3, pp. 435–443, March 2015.
- [5] N. K. Kancharapu and M. Dave and V. Masimukkula and M. S. Baghini and D. K. Sharma, "A Low-Power Low-Skew Current-Mode Clock Distribution Network in 90nm CMOS Technology," *ISVLSI*, July 2011, pp. 132–137.
- [6] M. Dave and M. Jain and S. Baghini and D. Sharma, "A Variation Tolerant Current-Mode Signaling Scheme for On-Chip Interconnects," *TVLSI*, vol. 21, no. 2, pp. 342–353, January 2013.
- [7] M. R. Guthaus and R. Islam, "Current-mode clock distribution," US Patent, no. 9787293, October 2017.
- [8] Kadirvel, K. and Carpenter, J. and Phuong Huynh and Ross, J.M. and Shoemaker, R. and Lum-Shue-Chan, B., "A Stackable, 6-Cell, Li-Ion, Battery Management IC for Electric Vehicles With 13, 12-bit ∑∆ ADCs, Cell Balancing, and Direct-Connect Current-Mode Communications," JSSC, vol. 49, no. 4, pp. 928–934, April 2014.
- [9] R. Islam and H. A. Fahmy and P. Y. Lin and M. R. Guthaus, "DCMCS: Highly Robust Low-Power Differential Current-Mode Clocking and Synthesis," *TVLSI*, vol., no., pp. 1–10, May 2018.
- [10] A. Narasimhan, M. Kasotiya, and R. Sridhar, "A low-swing differential signalling scheme for on-chip global interconnects," *ICVD*, Jan 2005, pp. 634–639.
- [11] R. Islam and M. R. Guthaus, "Low-Power Clock Distribution Using a Current-Pulsed Clocked Flip-Flop," *TCASI*, vol. 62, no. 4, pp. 1156– 1164, Apr 2015.
- [12] R. Islam and M. R. Guthaus, "Current-mode clock distribution," *ISCAS*, June 2014, pp. 1203–1206.
- [13] A. Maheshwari and W. Burleson, "Differential current-sensing for onchip interconnects," *TVLSI*, vol. 12, no. 12, pp. 1321–1329, Dec 2004.
- [14] R. Islam and H. Fahmy and Ping-Yao Lin and M. R. Guthaus, "Differential current-mode clock distribution," *MWSCAS*, August 2015, pp. 1–4.
- [15] A. Maheshwari and W. Burleson, "Current-Sensing and Repeater Hybrid Circuit Technique for On-Chip Interconnects," *TVLSI*, vol. 15, no. 11, Nov 2007, pp. 1239–1244.
- [16] L. Ravezzi and H. Partovi, "Clock and Synchronization Networks for a 3 GHz 64 Bit ARMv8 8-Core SoC," JSSC, vol. 50, no. 7, July 2015, pp. 1702–1710.
- [17] Chan, S.C. and Shepard, K.L. and Restle, P.J., "Uniform-Phase Uniform-Amplitude Resonant-Load Global Clock Distributions," *JSSC*, vol. 40, no. 1, January 2005, pp. 102–109.
- [18] Guthaus, Matthew R. and Wilke, Gustavo and Reis, Ricardo, "Revisiting automated physical synthesis of high-performance clock networks," *TODAES*, vol. 18, no. 2, April 2013, pp. 31:1–31:27.
- [19] X. W. Shih and Y. W. Chang, "Fast Timing-Model Independent Buffered Clock-Tree Synthesis," *TCAD*, vol. 31, no. 9, September 2012, pp. 1393– 1404.
- [20] R. Islam and M. R. Guthaus, "CMCS: Current-Mode Clock Synthesis," *TVLSI*, vol. 25, no. 3, Mar 2017, pp. 1054–1062.
- [21] NCSU, "FreePDK45," http://www.eda.ncsu.edu/wiki/FreePDK45.

- [22] J. Mauricio and F. Moll and S. Gmez, "Measurements of Process Variability in 40-nm Regular and Nonregular Layouts," *TED*, vol. 61, no. 2, February 2014, pp. 365–371.
- [23] T. J. Yamaguchi and M. Soma and D. Halter and R. Raina and J. Nissen and M. Ishida, "A method for measuring the cycle-to-cycle period jitter of high-frequency clock signals," VTS, 2001, pp. 102–110.
- [24] C. N. Sze, P. Restle, G. J. Nam, and C. J. Alpert, "Clocking and the ISPD'09 clock synthesis contest," *ISPD*, Mar 2009, pp. 149–150.
- [25] R.-S. Tsay, "Exact zero skew," ICCAD, Nov 1991, pp. 336-339.
- [26] R. S. Tsay, "An exact zero-skew clock routing algorithm," TCAD, vol. 12, no. 2, February 1993, pp. 242–249.
- [27] Boese, K.D. and Kahng, A.B., "Zero-skew clock routing trees with minimum wirelength," ASIC, September 1992, pp. 17–21.
- [28] F. Brglez and D. Bryan and K. Kozminski, "Combinational profiles of sequential benchmark circuits," *ISCAS*, May 1989, pp. 1929–1934.
- [29] C. N. Sze, "ISPD 2010 High Performance Clock Network Synthesis Contest," ISPD, Mar 2010.



**Riadul Islam** is currently an assistant Professor in the Department of Electrical and Computer Engineering at University of Michigan-Dearborn. In his Ph.D. dissertation work at UCSC, Dr. Riadul designed the first current-pulsed flip-flop/register that resulted in the first-ever one-to-many current-mode clock distribution networks for high-performance microprocessors. From 2007 to 2009, he worked as a full-time faculty member in the Department of Electrical and Electronic Engineering of the University of Asia Pacific, Dhaka, Bangladesh. He is a member of

the IEEE, IEEE Circuits and Systems (CAS) society. He holds one US patent and several IEEE/ACM journal and conference publications in TVLSI, TCAS, ISCAS, MWSCAS, ISQED, and ASICON. His current research interests include digital, analog, and mixed-signal CMOS ICs/SOCs for a variety of applications; verification and testing techniques for analog, digital and mixedsignal ICs; CAD tools for design and analysis of microprocessors and FPGAs; automobile electronics; and biochips.



Matthew R. Guthaus is currently a full Professor at the University of California Santa Cruz in the Computer Engineering department. Matthew received his BSE in Computer Engineering in 1998, MSE in 2000, and PhD in 2006 in Electrical Engineering all from The University of Michigan. Matthew is a Senior Member of ACM and IEEE and a member of IFIP Working Group 10.5. His research interests are in low-power computing including applications in mobile health systems. This includes new circuits, architectures, and sensors along with their applica-

tion to mobile and clinical health systems. He is the creator of the OpenRAM memory compiler. Matthew is the recipient of a 2011 NSF CAREER award and a 2010 ACM SIGDA Distinguished Service Award.