# UCLA UCLA Electronic Theses and Dissertations

**Title** A 56-Gb/s 8-mW PAM4 CDR with High Jitter Tolerance

Permalink https://escholarship.org/uc/item/8j44169k

Author Hou, Guanrong

Publication Date 2021

Peer reviewed|Thesis/dissertation

### UNIVERSITY OF CALIFORNIA

Los Angeles

A 56-Gb/s 8-mW PAM4 CDR with High Jitter Tolerance

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering

by

Guanrong Hou

2021

© Copyright by Guanrong Hou 2021

### ABSTRACT OF THE DISSERTATION

A 56-Gb/s 8-mW PAM4 CDR with High Jitter Tolerance

by

Guanrong Hou Doctor of Philosophy in Electrical Engineering University of California, Los Angeles, 2021 Professor Behzad Razavi, Chair

Advancement in smart devices, development in cloud computing, and surge in Internet usage means an ever fast increase in demand for wireline communications, which include veryshort-range chip-to-chip communications, data center interconnections, cross data center interconnections, and metro and long-haul communications. Four-level pulse-amplitudemodulation (PAM4) is proven to be the latest trend and replacement for traditional nonreturn-to-zero (NRZ) standards. Of current PAM4 standards, 56-Gb/s is one of the popular data rates to realize wireline data communication.

In PAM4 wireline communications, clock and data recovery (CDR) circuit is one of the most important building blocks, without which it is impossible to receive the correct input data and realize wireline data communication. In most applications, a typical CDR decides recovered clock jitter, loop bandwidth, and jitter tolerance. Regarding power consumption, for line-side applications, a CDR takes a significant amount of power (20% in some cases); while for host-side applications, a CDR uses most of the power. Therefore, we would like a CDR that has low recovered clock jitter, high jitter tolerance, low power consumption, and a proper loop bandwidth depending on specific standards.

A review of current works shows that, most recent PAM4 CDRs are still slicer based or ADC/DSP (Analog-to-Digital-Converter/Digital-Signal-Processing) based, which means an incoming PAM4 signal is transformed into NRZ signals first, and then they are processed with traditional NRZ approaches. Slicers or ADC/DSP usually lead to high power consumption, and/or heavy calibration.

This work proposes a new 56-Gb/s PAM4 CDR architecture. It uses a proposed phase detector (PD) design that processes PAM4 signal in analog domain by generating Euclidean distances among the samples. This work also proposes an analog background offset-cancellation scheme that makes the PD robust.

Realized in 28 nm CMOS technology, the CDR prototype consumes a total power of 8 mW. It has a 547-fs recovered root mean square clock jitter for a 160-MHz loop bandwidth and at least 1 unit interval (UI) jitter tolerance (BER  $< 10^{-12}$ ) at 10MHz.

The dissertation of Guanrong Hou is approved.

Danijela Cabric

Jason Cong

William Kaiser

Behzad Razavi, Committee Chair

University of California, Los Angeles

2021

To my parents

# TABLE OF CONTENTS

| Li               | st of | $\mathbf{\tilde{F}igures}$                                         | iii      |
|------------------|-------|--------------------------------------------------------------------|----------|
| $\mathbf{Li}$    | st of | Tables                                                             | xii      |
| A                | cknov | $\mathbf{w} \mathbf{ledgments}$                                    | iii      |
| $\mathbf{C}_{1}$ | urric | ulum Vitae                                                         | xv       |
| 1                | Intr  | $\operatorname{roduction}$                                         | 1        |
|                  | 1.1   | Motivation                                                         | 1        |
|                  | 1.2   | Thesis Organization                                                | 4        |
| 2                | Bac   | kground                                                            | <b>5</b> |
|                  | 2.1   | NRZ Clock and Data Recovery                                        | 5        |
|                  |       | 2.1.1 NRZ CDR Performance Parameters                               | 5        |
|                  |       | 2.1.2 CDR Generic Architecture                                     | 8        |
|                  |       | 2.1.3 Alexander PD                                                 | 9        |
|                  |       | 2.1.4 Hogge PD                                                     | 12       |
|                  |       | 2.1.5 Baud-Rate PD                                                 | 15       |
|                  |       | 2.1.6 Other CDR Components                                         | 19       |
|                  | 2.2   | PAM4 Clock and Data Recovery                                       | 20       |
| 3                | Des   | Sign of a 56-Gb/s 8-mW PAM4 CDR with High Jitter Tolerance $\dots$ | 30       |
|                  | 3.1   | Overview of the Architecture                                       | 30       |
|                  | 3.2   | Phase Detector Basis                                               | 31       |

|   |       | 3.2.1   | Euclidean Distance                             | 31        |
|---|-------|---------|------------------------------------------------|-----------|
|   |       | 3.2.2   | Timing                                         | 35        |
|   |       | 3.2.3   | The Circuit to Generate the Euclidean Distance | 37        |
|   |       | 3.2.4   | The Need for Two Euclidean Distances           | 39        |
|   |       | 3.2.5   | Next Step                                      | 43        |
|   | 3.3   | Offset  | Cancellation                                   | 44        |
|   |       | 3.3.1   | Concepts                                       | 44        |
|   |       | 3.3.2   | Proposed Comparator Circuit                    | 46        |
|   |       | 3.3.3   | Complete Offset-Cancellation Loop              | 51        |
|   |       | 3.3.4   | Retime the Output of the Proposed Comparator   | 54        |
|   | 3.4   | Data I  | Extraction                                     | 56        |
|   | 3.5   | Charge  | e Pump                                         | 57        |
|   | 3.6   | Overal  | ll Phase Detector and Charge Pump Unit         | 58        |
|   | 3.7   | VCO a   | and Clock Generation                           | 65        |
|   |       | 3.7.1   | VCO                                            | 65        |
|   |       | 3.7.2   | Clock Generation                               | 65        |
| 4 | Exp   | erime   | ntal Results                                   | 68        |
|   | 4.1   | Die Pl  | notograph                                      | 68        |
|   | 4.2   | Experi  | ment Setup                                     | 68        |
|   | 4.3   | Measu   | rement Results                                 | 69        |
|   | 4.4   | Compa   | arison Table                                   | 72        |
| 5 | Cor   | nclusio | a                                              | 73        |
| R | efere | nces .  |                                                | <b>74</b> |

# LIST OF FIGURES

| 1.1  | Simulated 30-inch-trace channel loss for 56-Gb/s PAM4 and NRZ     | 2  |
|------|-------------------------------------------------------------------|----|
| 1.2  | A generic architecture of a NRZ or PAM4 RX                        | 3  |
| 2.1  | Theoretical jitter transfer                                       | 6  |
| 2.2  | Theoretical jitter tolerance.                                     | 7  |
| 2.3  | CEI-56G-MR-PAM4 receiver jitter tolerance mask.                   | 8  |
| 2.4  | A behavioral CDR block diagram                                    | 8  |
| 2.5  | Alexander PD.                                                     | 9  |
| 2.6  | Alexander PD waveforms.                                           | 10 |
| 2.7  | Ideal Alexander PD transfer characteristic.                       | 11 |
| 2.8  | Half-rate Alexander PD timing                                     | 12 |
| 2.9  | Hogge PD                                                          | 12 |
| 2.10 | Hogge PD waveforms.                                               | 13 |
| 2.11 | Ideal Hogge PD transfer characteristic.                           | 14 |
| 2.12 | Half-rate linear PD                                               | 15 |
| 2.13 | Baud-rate PD exemplary waveforms.                                 | 16 |
| 2.14 | Muller-Muller PD                                                  | 18 |
| 2.15 | Analog CDR loop filters                                           | 19 |
| 2.16 | (a) LC VCO, (b) Ring VCO.                                         | 19 |
| 2.17 | A 4-level PAM4 eye and the three levels of slicers needed         | 20 |
| 2.18 | A PAM4 signal transformed into digital signals.                   | 21 |
| 2.19 | (a) Ideal Hogge PD waveforms, (b) A more realistic representation | 22 |
| 2.20 | Sampling both data points and edge points with slicers            | 24 |

| 2.21 | Combining results from three PD units                                                                     | 24 |
|------|-----------------------------------------------------------------------------------------------------------|----|
| 2.22 | A simple method to combine results from three PD units                                                    | 25 |
| 2.23 | Three types of transitions in PAM4: (a) major transitions, (b) middle transitions,                        |    |
|      | and (c) minor transitions.                                                                                | 26 |
| 2.24 | Majority Voting.                                                                                          | 27 |
| 2.25 | Deciding early or late based on sampled edge voltage                                                      | 28 |
| 3.1  | Overview of the architecture.                                                                             | 31 |
| 3.2  | Alexander PD detects (a) clock late, and (b) clock early                                                  | 32 |
| 3.3  | Alexander PD detects (a) clock late, and (b) clock early with more realistic                              |    |
|      | waveforms.                                                                                                | 33 |
| 3.4  | Measuring Euclidean distances for PAM4 when (a) the clock is late and (b) the                             |    |
|      | clock is early.                                                                                           | 33 |
| 3.5  | $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$ for all PAM4 transitions                                            | 34 |
| 3.6  | Timing challenge for $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$                                                 | 35 |
| 3.7  | One-eighth rate for $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$                                                  | 36 |
| 3.8  | Quarter rate for $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$ as a comparison                                    | 37 |
| 3.9  | The circuit for $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .                                                    | 38 |
| 3.10 | Simulated waveforms for the $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$ circuit.                                | 39 |
| 3.11 | Low-to-high transitions and high-to-low transitions in PAM4                                               | 40 |
| 3.12 | The circuit for $V_{\rm A} - V_{\rm B}$                                                                   | 41 |
| 3.13 | Simulated waveforms for the $V_{\rm A} - V_{\rm B}$ circuit.                                              | 42 |
| 3.14 | Multiplication of $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ and $V_{\rm A} - V_{\rm B}$ when the clock is late | 42 |
| 3.15 | Conventional offset cancellation approach.                                                                | 44 |
| 3.16 | Concepts of the proposed offset-cancellation comparator.                                                  | 44 |

| 3.17 | Phase 1 of the proposed offset-cancellation scheme.                                                                 | 45 |
|------|---------------------------------------------------------------------------------------------------------------------|----|
| 3.18 | Phase 2 of the proposed offset-cancellation scheme.                                                                 | 45 |
| 3.19 | Phase 3 of the proposed offset-cancellation scheme.                                                                 | 46 |
| 3.20 | Proposed comparator.                                                                                                | 47 |
| 3.21 | Two configurations: (a) differential configuration, and (b) regenerative configu-                                   |    |
|      | ration                                                                                                              | 47 |
| 3.22 | Timing of $S_1$ , $S_2$ , $S_3$ , $S_4$ , and $S_T$                                                                 | 48 |
| 3.23 | Timing of $S_5$ , $S_6$ , $S_7$ , $S_8$ , $S_9$ , and $S_{10}$                                                      | 49 |
| 3.24 | Simulated waveforms of the proposed comparator                                                                      | 50 |
| 3.25 | Simulated waveforms of $V_{os}$ in the presence of offsets                                                          | 51 |
| 3.26 | Background offset-cancellation loops: (a) $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$ , and (b) $V_{\rm A} - V_{\rm B}$ . | 52 |
| 3.27 | A mathematical model for the background offset-cancellation loop. $\ldots$                                          | 52 |
| 3.28 | The effect of the residual input-referred DC offset, $\delta$                                                       | 53 |
| 3.29 | The total input-referred DC offset reduces to $\delta$ over time in the closed loop                                 | 53 |
| 3.30 | Differential and single-ended waveforms of $V_{\rm out}$ of the proposed comparator                                 | 54 |
| 3.31 | Timing of the retimer.                                                                                              | 55 |
| 3.32 | One data extraction unit.                                                                                           | 56 |
| 3.33 | Charge pump                                                                                                         | 57 |
| 3.34 | An overview of the phase detector and its associated charge pump unit                                               | 59 |
| 3.35 | Simulated average output current versus phase error.                                                                | 60 |
| 3.36 | PD characteristics with and without offset-cancellation                                                             | 60 |
| 3.37 | PD characteristics with slicer offsets.                                                                             | 62 |
| 3.38 | ( <i>continued</i> ) PD characteristics with slicer offsets.                                                        | 62 |
| 3.39 | Transitions detected in case "D"                                                                                    | 63 |
| 3.40 | Transitions detected in case "E"                                                                                    | 64 |

| 3.41 | Transitions detected in case "F". $\ldots$ | 64 |
|------|--------------------------------------------|----|
| 3.42 | VCO and inductor design.                   | 65 |
| 3.43 | Clock generation.                          | 66 |
| 3.44 | Clock divider latch.                       | 67 |
|      |                                            |    |
| 4.1  | Die photograph                             | 68 |
| 4.2  | Experiment setup.                          | 69 |
| 4.3  | Jitter transfer.                           | 69 |
| 4.4  | Jitter tolerance.                          | 70 |
| 4.5  | Recovered clock spectrum.                  | 71 |
| 4.6  | Phase noise from 100 Hz to 100 MHz         | 71 |

# LIST OF TABLES

| 1.1 | State-of-the-art PAM4 CDR                | 3  |
|-----|------------------------------------------|----|
| 4.1 | Performance summary and comparison table | 72 |

### Acknowledgments

First of all, I want to express great gratitude to my advisor Professor Behzad Razavi. He gave me an opportunity and guided me patiently over the years. I still remember the three words 'intelligence, passion, and creativity' as told by him. He is the gold standard in terms of not just research and design, but also teaching. Besides meeting with him, his textbooks already provided great help.

I really appreciate the time and help from my committee members, Professor Danijela Cabric, Professor Jason Cong, and Professor William Kaiser for their advice and suggestions. I want to thank Prof Mario Gerla as well.

I want to thank all professors and teaching assistants from whom I have received lectures and discussions. I learned from circuit fundamentals to beyond circuits. I want to thank all professors I have worked with when I was a teaching assistant as well, since I learned from them from technical knowledge to management skills.

I want to thank all members in our research group I have worked with as well. Long and Abishek told me lots of things I should know. Yikun helped me with design verification and clock generation. Mehrdad helped me with 28 nm PDK. Hossein, Onur, Yu and Matias helped with various topics, such as using HFSS, using probe station, and so on. I also owe a tremendous amount of thanks to Atharav for his help and advice over the years. He helped me with verifying my design and offered invaluable advice. He also showed me pad frame design, BERT knowledge, serial bus programming, PCB design, lab testing, and so on.

My internship experience at Tensorcom also helped me a lot and I want to thank Zaw Soe, Steve Gao, Kevin Jing, James Yu, and Fangzhuo Dong. I have had a great time there.

I want to thank the PKU/UCLA Integrated BS+MS "3+2" Program and Professor Jason Cong for presiding over it. This program helped me to come to UCLA and realize that my interest is in circuits.

I also want to thank the UCLA ECE department. Deeona and her office helped me in all

sorts of paperwork and procedure issues. Minji from Center for High Frequency Electronics helped me from dicing and wire bonding to instructing me on various lab work details.

Most importantly, I want to thank my mom Yuzhu Liu and my dad Wenguo Hou. I only hope I have honored them.

# CURRICULUM VITAE

| B.S. in Electrical Engineering, Peking University, Beijing, China. |
|--------------------------------------------------------------------|
| M.S. in Electrical Engineering, UCLA, CA, USA.                     |
| Preliminary Examination Fellowship, Electrical Engineering De-     |
| partment, UCLA, CA, USA                                            |
| RFIC Engineer Intern, Tensorcom Inc., San Diego, CA, USA           |
| Teaching Associate, Electrical Engineering Department, UCLA,       |
| CA, USA.                                                           |
| Teaching Fellow, Electrical Engineering Department, UCLA, CA,      |
| USA.                                                               |
|                                                                    |

# CHAPTER 1

# Introduction

### 1.1 Motivation

Online streaming, video conferencing and other Internet usage has become more and more prevalent in individuals' everyday life as well as in organizations' regular operations. According to [1], Internet devices will have a compound annual growth rate of 10% from 2018 to 2023, Ethernet speed will at least double from 2018 to 2023, and cellular or Wi-Fi speed will at least triple from 2018 to 2023. With the help of high speed and easy access Internet, cloud computing has becoming more and more popular as well. While research organizations, corporations and government agencies continue to utilize more and more cloud computing, individual users start to use this service as well, such as Stadia cloud gaming service, and NFL's Next Gen Stats by AWS.

These trends mean that more and more data will be transmitted among circuit chips in personal devices, among machine racks, among data centers, among cities and across the globe via optical wireline links. Non-return-to-zero (NRZ) has always been a popular choice wireline links. To accommodate for higher data throughput, NRZ standards will have to keep increasing its data rate or Nyquist frequency. However, as Nyquist frequency increases, so is channel loss. When channel loss is large enough, it will become almost impossible or very costly to realize data communication for NRZ standards.

4-level pulse-amplitude modulation (PAM4) transmits two bits per symbol while NRZ transmits one bit per symbol. To achieve the same data rate as NRZ, PAM4's Nyquist frequency is only half of that of NRZ, which means a significant reduction in channel loss. This makes PAM4 a better option for high-speed wireline data transmission than NRZ. Using



Figure 1.1: Simulated 30-inch-trace channel loss for 56-Gb/s PAM4 and NRZ.

FR4 microstrip line model proposed in [2], Fig. 1.1 shows the simulated channel loss of a 30-inch trace. It shows that by changing from NRZ to PAM4, the channel loss decreases by more than 10 dB. Such trends show the potential of PAM4 over NRZ.

Investigations from [3] have shown that PAM4 is a promising solution. [4] has stated that "50G PAM4 will stand out with its price-to-performance ratio and have full market potentials" (p. 20). [5] has stated that global optical transceiver market by 2024 will be almost \$2B and "sales of PAM4 DSP chipsets for applications in Ethernet transceivers and AOCs will account for half of this market segment" (para. 1).

For either NRZ or PAM4 applications, clock and data recovery (CDR) is an important and necessary module as shown in Fig. 1.2. CDR extracts clock from incoming data and retimes the data to remove jitter accumulated during transmission and it must also satisfy strict requirements set by certain standards [6]. Furthermore, a high performance CDR can relax other aspects of the transceivers. For example, a wide loop bandwidth CDR can relax



Figure 1.2: A generic architecture of a NRZ or PAM4 RX.

the design of the oscillator and clock generation circuits. A high jitter tolerance CDR can relax requirement of incoming data's maximum jitter amplitude.

Besides high performance, we also want the CDR to be low power. According to [7], in 2020, it is estimated that US will have 18 million server installed base. [7] also states that "in 2014, data centers in the U.S. consumed an estimated 70 billion kWh, representing about 1.8% of total U.S. electricity consumption" and "based on current trend estimates, U.S. data centers are projected to consume approximately 73 billion kWh in 2020" (p. ES-1). High power consumption also means lots of heat will be generated within data centers, and therefore water is needed for cool-down purposes. 660 billion liters of water will be used in 2020 for this reason [7]. A lower power CDR means a reduction in data center electricity consumption and water consumption. On the other hand, lowering power consumption for personal devices is also significant. It will help save energy bills or extend battery life. In some short-range applications, power consumption is the bottleneck as well.

Table 1.1: State-of-the-art PAM4 CDR

|                              | Roshan-Zamir<br>JSSC 2019 | Aurangozeb<br>JSSC 2018 | Zhang<br>JSSC 2020 | Zhao<br>CICC 2020 | DH. Kwon<br>TCS-II 2019 |
|------------------------------|---------------------------|-------------------------|--------------------|-------------------|-------------------------|
| Data Rate (Gb/s)             | 56                        | 28                      | 32                 | 29.1              | 32                      |
| Power (mW)                   | 49.2*                     | 47*                     | 14.7               | 19.16             | 32                      |
| Power Efficiency (pJ/bit)    | 0.88                      | 1.68                    | 0.46               | 0.66              | 1                       |
| 1-UI Jitter Tol. Freq. (MHz) | 0.5                       | 0.6                     | 2                  | 1.8               | 1                       |
| Loop BW (MHz)                | 10                        | 11                      | 10                 | 12                | 10                      |

\*Only including CDR portion for fair comparison.

Table 1.1 shows the state-of-the-art PAM4 CDRs ([8], [9], [10], [11], and [12]) and summarize their power, jitter tolerance, and loop performance. For the phase detector (PD) design, they all require a set of slicers or a high-accuracy ADC, which means higher power consumption, and/or heavy calibration. In essence, these designs all transform an input analog PAM4 signal to a set of digital NRZ signals via slicers or an ADC. This similarity is an important reason why these designs share similarity in terms of power efficiency and jitter tolerance.

Therefore, the above reasons motivate us to design a new PD and realize a low-power high-performance PAM4 CDR.

### **1.2** Thesis Organization

This thesis describes a 56-Gb/s PAM4 CDR prototype that consumes 8 mW. The prototype achieves a 160 MHz loop bandwidth and at least 1 unit interval (UI) jitter tolerance at 10 MHz. The 28 GHz output clock has a root mean square (rms) jitter of 547 fs. The loop bandwidth can vary from 25 MHz to 160 MHz.

Chapter 2 reviews basics of CDR and PD design.

Chapter 3 presents this design in detail. It shows the gradual progress of designing the proposed PD. It also demonstrates the thought process behind the proposed background analog offset cancellation scheme. Other parts of the CDR are also explained, including the charge pump (CP), voltage-controlled oscillator (VCO), clock generation circuits, and data extraction units.

Chapter 4 shows the experiment setup and measurement results.

Chapter 5 summarizes this dissertation.

# CHAPTER 2

# Background

### 2.1 NRZ Clock and Data Recovery

NRZ wireline communications have been used in all kinds of scenarios for decades and various designs are proposed and used, therefore, it is worth studying how a NRZ CDR extracts the clock and recovers input data.

#### 2.1.1 NRZ CDR Performance Parameters

A CDR is evaluated in the following terms (but not limited to): bit error rate (BER), jitter transfer, jitter tolerance, and output jitter.

BER is defined as the total number of error bits at CDR output for a given amount of time. A related term is bit error ratio, which is defined as the total number of error bits over the total amount of bits transmitted for a given amount of time. They both represent how well a CDR can extract data correctly and we want them to be as low as possible.

[6] provides the following equation to estimate a NRZ CDR's BER:

$$BER = Q(\frac{V_{pp}}{2\sigma_n}), \qquad (2.1)$$

where  $V_{\rm pp}$  is defined as the peak-to-peak signal swing at the CDR input,  $\sigma_{\rm n}$  is the rms value of the noise, and

$$Q(x) = \int_{x}^{\infty} \frac{1}{\sqrt{2\pi}} \exp(\frac{-u^2}{2}) \, du.$$
 (2.2)

Common communication standards require a BER of at most  $10^{-12}$ . Some standards with forward error correction coding could tolerate a BER of  $10^{-6}$ . To achieve a BER of



Figure 2.1: Theoretical jitter transfer.

 $10^{-12}$ ,  $V_{\rm pp}/2\sigma_{\rm n}$  should be at least 7 [6].  $V_{\rm pp}$  could be affected by input eye opening, CDR sampling point with respect to the input eye, CDR's sampling speed or bandwidth, and so on.  $\sigma_{\rm n}$  could be affected by CDR circuits' device noise, CDR's input bandwidth, and so on. Therefore, CDR design directly decides its BER.

Jitter transfer is measured against frequency. If the phase of the input data is modulated with an extremely low frequency, the CDR should be able to track it almost perfectly. That is, as zero crossing point and maximum eye-opening point shift slowly, the CDR should still sample input data at maximum eye opening. As the modulation frequency or jitter frequency increases, the CDR will be less able to track it. This CDR property is often characterized by loop bandwidth. Different communication standards will have different requirements regarding jitter transfer and loop bandwidth. Fig. 2.1 shows a theoretical jitter transfer plot, where the corner frequency  $\omega_o$  is determined by loop parameters.

Jitter tolerance is also measured against frequency like jitter transfer. At each jitter frequency, when the phase modulation amplitude is small, CDR should still track it with ease and its BER should still meet the target. However, the amplitude will increase to a point where the BER increases above the limit. This amplitude, often described with  $UI_{pp}$  (peak-to-peak), is the jitter tolerance amplitude for this jitter frequency. Fig. 2.2 shows a theoretical jitter tolerance plot. At lower frequency, CDR is able to track the input periodical



Figure 2.2: Theoretical jitter tolerance.

jitter, and the lower the jitter frequency is, the easier it is for the CDR to trake it. When the jitter frequency is high enough, CDR is not able to track it, so now the effective eye opening for the CDR is decreased. The 0.5 comes from the assumption that the CDR is ideal and the eye opening horizontally is  $0.5 \text{ UI}_{pp}$ .

Communication standards usually define a jitter mask like the one shown in Fig. 2.3 with data from [13]. It usually has requirements on corner frequency and tolerable jitter amplitude at high jitter frequency.

At any jitter frequency, the CDR should tolerate a minimum jitter amplitude as plotted on the jitter tolerance mask.

Output jitter is determined by (but not limited to) the following factors: input data jitter, VCO phase noise, VCO control voltage ripple, supply and substrate noise, and direct coupling [6]. Except input data jitter, all other four factors are determined by CDR design. There is a trade off between input data jitter contribution and VCO phase noise contribution towards the total output jitter in the following way. CDR loop is a low pass filter to input data jitter, and a high pass filter to VCO phase noise. If CDR loop bandwidth is increased, then we are suppressing more VCO phase noise and letting through more input data jitter, and vice versa. For example, if input data is low on jitter, then a wider bandwidth is a better choice since it will suppress more of VCO phase noise.



Figure 2.4: A behavioral CDR block diagram.

### 2.1.2 CDR Generic Architecture

Fig. 2.4 shows a behavioral CDR block diagram. In some designs, different units may share same circuits. For example, the data extraction block is realized within the PD block.

PD decides if clock is early or late with respect to the phase of the input data. In an analog CDR, CP produces positive or negative currents based on PD, and these currents will go to the loop filter. The loop filter's voltage will change VCO's frequency to adjust the clock phase. Depending on the CDR architecture, VCO's output clock might need to be divided or combined to create proper clocks to drive other blocks.



Figure 2.5: Alexander PD.

In a NRZ CDR, the data extraction block is often included in the PD block already, since PD needs to extract data to make a phase decision. In PAM4 designs, sometimes the PD only recovers one bit of information and the rest is in the data extraction.

#### 2.1.3 Alexander PD

Fig. 2.5 shows the Alexander PD proposed in [14].

 $D_{\rm in}$  is the input data signal. CLK is the input clock signal driving the PD and we want it to align correctly with respect to Din. Up and Down are signals sent to the VCO. If Upis on, VCO's control voltage will increase, then the VCO's frequency will increase, hence CLK's frequency is increased, and vice versa.

Fig. 2.6 shows the waveforms of all labelled signals in the Alexander PD (Fig. 2.5). The flip-flop samples at CLK rising edge. B is the result of flip-flop sampling  $D_{in}$  at CLK rising edges, and E' is the result of flip-flop sampling  $D_{in}$  at CLK falling edges. A is B with one clock period delay, and E is E' retimed to be in phase with A and B. At any moment, B is the sampled result of current symbol. A is the sampled result of previous symbol. E is the sampled result of the edge (or transition) between A and B.

As we can observe, when CLK is late, Up is on whenever there is a transition, and falling edge samples coincide with subsequent rising edge samples. When CLK is early, Down is on whenever there is a transition, and falling edge samples coincide with previous rising edge



Figure 2.6: Alexander PD waveforms.

samples. When CDR is locked, CLK will be aligned to where E' is sampled in the middle of transitions. A or B contains the extracted data. Up and Down will average each other out in this condition.

Alexander PD is a bang-bang PD. No matter how late CLK is with respect to  $D_{in}$ , Up pulse is always one CLK period wide for one transition. Similarly, no matter how early CLK is with respect to  $D_{in}$ , Down pulse is always one CLK period wide for one transition as well. Hence, the average PD output (Up-Down) does not change as the phase error (how mucn early or late CLK is) changes as shown in Fig. 2.7 for an ideal Alexander PD.

Fig. 2.7 reveals the "bang-bang" property of Alexander PD: the PD output only shows the sign of the phase error (1 or -1). Due to limited input bandwidth, limited sampling aperture, metastability, noise and so on, actual Alexander PD transfer characteristic looks



Figure 2.7: Ideal Alexander PD transfer characteristic.

more linearized around zero phase error [15].

Alexander PD is also called a 2x-oversampled PD. For every symbol transmitted, a data sample is being made, and an edge sample is being made as well. Hence there are two samples for each symbol.

As Fig. 2.6 shows, for every new data symbol, there is a rising CLK edge. Therefore, if the data rate is 10 Gb/s, CLK should be 10 GHz. Therefore, Fig. 2.5 is also called a full-rate phase detector. It is full rate in that CLK frequency is equal to the data rate.

Sometimes it may be difficult to implement a full-rate Alexander PD since it is not always easy or power-consumption friendly to generate a full-rate clock signal. Therefore, it is beneficial to realize Alexander PD with a lower clock frequency.

Fig. 2.8 shows the timing essence of realizing an Alexander PD with half-rate clocks. The key modification is that quadrature half-rate clocks  $CLK_{\rm I}$  and  $CLK_{\rm Q}$  are being used. Let the data rate be 10 Gb/s, the time difference between a data sample and its adjacent edge sample is 50 ps. For a 5 GHz clock, 90° phase difference translates to 50 ps. In Fig. 2.8,  $CLK_{\rm I}$  samples data points at both edges, and  $CLK_{\rm Q}$  samples edge points at both edges as



Figure 2.8: Half-rate Alexander PD timing.



Figure 2.9: Hogge PD.

well. This makes sure that for every 50 ps, the PD samples  $D_{\rm in}$ . Corresponding outputs need to be retimed to generate correct signals for the VCO control loop.

Observing  $D_{in}$  in Fig. 2.8 and Fig. 2.6 reveals that whether it is a half-rate or full-rate Alexander PD, it is still a 2x-oversampled PD: one data sampled, then one subsequent edge sampled, so on and so forth. Same is true for even lower rate PDs, such as quarter rate and so on.

#### 2.1.4 Hogge PD

Fig. 2.9 shows the Hogge PD proposed in [16]. A distinction between Hogge PD and Alexander PD (Fig. 2.5) is that in Hogge PD  $D_{in}$  drives a logic gate which will present stringent bandwidth and speed requirement for this logic gate.



Figure 2.10: Hogge PD waveforms.

Fig. 2.10 shows how Hogge PD decides whether CLK is early or late with respect to  $D_{\rm in}$ . *A* is *B* delayed by half period of CLK ( $T_{\rm CLK}/2$ ). Therefore, whenever there is a transition, *Reference* will always generate a pulse of  $T_{\rm CLK}/2$  width. The width of the *Error* pulse depends on the time difference between  $D_{\rm in}$ 's zero-crossing transition and CLK rising edge.

When *Error* is on, a constant current will go to VCO control loop, thus increasing its frequency. When *Reference* is on, a constant current will be drawn from VCO control loop, thus decreasing its frequency.

Ideally, when CLK's rising edge happens  $T_{CLK}/2$  later than  $D_{in}$ 's zero-crossing transition, Error pulse and Reference pulse will have equal width, hence this is where a Hogge-PD CDR is looked. Typically,  $T_{CLK}/2$  after zero-crossing transition gives most optimal eye opening for data extraction, and A or B contains extracted data.

Fig. 2.11 shows an ideal Hogge-PD transfer characteristic. The linearity stems from the linearity between *Error* pulse width and the amount of phase error. Thus Hogge PD is a linear PD. Compared with Fig. 2.7, Hogge PD not only shows the sign of the phase error, but also the magnitude. However, this addition of information comes with a price: Alexander



Figure 2.11: Ideal Hogge PD transfer characteristic.

PD design mostly worries about designing a high-speed sampling flip-flop, while Hogge PD in addition needs to worry about processing input  $D_{in}$  with a logic circuit directly.

Hogge PD can also be thought of as a 2x-oversampled PD. Observing Fig. 2.10, the PD output is decided by *Error*'s pulse width. The rising edge of *Error* comes from  $D_{in}$ 's data-transition edge, and the falling edge of *Error* comes from *CLK*'s rising edge sampling Din. The former corresponds to E' and E in Fig. 2.5, and the latter corresponds to A and B in Fig. 2.5.

As Fig. 2.10 reveals, *CLK* frequency is the same as the data rate, therefore, Hogge PD is a full-rate PD. Just like Alexander PD can be modified into a half-rate bang-bang PD, Hogge PD can be modified into a half-rate linear PD as well (Fig. 2.12, proposed in [17]).

The half-rate linear PD in [17] uses four latches instead of two flip-flops in [16].  $L_1$  followed by  $L_2$  is equivalent to one flip-flop, and  $L_3$  followed by  $L_4$  is equivalent to one flip-flop as well.  $L_1$ - $L_2$  flip-flop samples  $D_{in}$  at CLK's rising edges, and  $L_3$ - $L_4$  flip-flop samples  $D_{in}$  at CLK's falling edges. Since CLK is half-rate here,  $B \oplus D$  is equivalent to  $A \oplus B$  in



Figure 2.12: Half-rate linear PD.

Fig. 2.9, generating a constant-width reference pulse whenever there is a transition. For every half *CLK* cycle, *A* and *C* follows  $D_{in}$  since  $L_1$  and  $L_3$  are latches. Therefore,  $A \oplus C$  is equivalent to  $D_{in} \oplus B$  in Fig. 2.9.

The full-rate Hogge PD in Fig. 2.9 and the half-rate linear PD in Fig. 2.12 can also be regarded as 2x-oversampled PDs. To generate *Reference* signals, data points need to be sampled (this is also the bare minimum for any CDR since data recovery is always required). Whenever a transition happens, the zero-crossing point determines the *Error* signal, so in a manner of speaking, edge points are "sampled" as well.

#### 2.1.5 Baud-Rate PD

So far, for all the PDs introduced, they all need to "extract" some extra information besides the necessary data samples. For Alexander PD, edge samples need to be made. For Hogge PD,  $D_{\rm in}$  needs to pass through an XOR gate directly or indirectly. Naturally, a question is raised: can we extract the clock and align the phase just by extracting what we really need, the data?

Fig. 2.13 shows input data signal  $D_{in}$  and clock signal CLK with a frequency equal to the baud rate. At each CLK rising edge, the CDR samples  $D_{in}$  hoping to recover the correct symbol. Ideally, CLK should be aligned to where  $D_{in}$  has the maximum eye opening.



Figure 2.13: Baud-rate PD exemplary waveforms.

Observing Fig. 2.13 may help us create a baud-rate PD.

In Fig. 2.13 (a), the  $D_{\rm in}$  pattern is "-1, 1, 1, -1". *CLK* samples  $D_{\rm in}$  later than the most optimal phase point, therefore, the normalized amplitude at *CLK* rising edge may not be maximum (-1 or 1). Here,  $d_1 = -0.8$ ,  $d_2 = 1$ ,  $d_3 = 0.8$ , and  $d_4 = -1$ . Consider the transition from  $d_1$  to  $d_2$ , the reason  $d_1$  is -0.8 instead of -1 is that *CLK* is late. By observing the absolute amplitude of  $d_1$ ,  $d_2$ ,  $d_3$ , and  $d_4$ , we can see that there is a 0.2 difference between neighboring samples. So, the question becomes: can we take advantage of this and extract the phase information?

In Fig. 2.13(a), there are three transitions:  $d_1$  to  $d_2$ ,  $d_2$  to  $d_3$ , and  $d_3$  to  $d_4$ . In each of these transitions, one amplitude is 1, and the other is 0.8. However, sometimes it is the first sample that is 0.8, and sometimes it is the second sample that is 0.8. This is because sometimes it is a -1 to 1 transition and sometimes it is a 1 to -1 transition. Therefore, to extract phase information, we need to know signs of the samples. To extract the sign, we can pass sample  $d_k$  through a quantizer to get the sign  $q_k$ . Let us look at the below equation

to see if it could be a candidate for a baud-rate PD.

$$p_{k} = d_{k} \times q_{k-1} - d_{k-1} \times q_{k}.$$
(2.3)

Equation (2.3 can extract phase information for Fig. 2.13(a). We can verify this by applying the numbers, we will have:

$$p_2 = d_2 \times q_1 - d_1 \times q_2 = 1 \times (-1) - (-0.8) \times 1 = -0.2, \tag{2.4}$$

$$p_3 = d_3 \times q_2 - d_2 \times q_3 = (0.8) \times 1 - 1 \times 1 = -0.2, \tag{2.5}$$

$$p_4 = d_4 \times q_3 - d_3 \times q_4 = (-1) \times 1 - 0.8 \times (-1) = -0.2.$$
(2.6)

A value of -0.2 indicates CLK is late and by a level of 0.2. For the PD to work properly, (2.3) also needs to produce correct outputs when CLK is early as in Fig. 2.13(b). We can verify it by plugging in numbers again:

$$p_6 = d_6 \times q_5 - d_5 \times q_6 = 0.8 \times (-1) - (-1) \times 1 = 0.2, \tag{2.7}$$

$$p_7 = d_7 \times q_6 - d_6 \times q_7 = 1 \times 1 - 0.8 \times 1 = 0.2, \tag{2.8}$$

$$p_8 = d_8 \times q_7 - d_7 \times q_8 = (-0.8) \times 1 - 1 \times (-1) = 0.2.$$
(2.9)

Therefore, (2.3) can process Fig. 2.13(b) as well. However, in Fig. 2.13(c) and Fig. 2.13(d), the pattern is "-1, 1, -1, 1", and (2.3) do not work properly:

$$p_{10} = d_{10} \times q_9 - d_9 \times q_{10} = 0.8 \times (-1) - (-0.8) \times 1 = 0, \qquad (2.10)$$

$$p_{12} = d_{12} \times q_{11} - d_{11} \times q_{12} = 0.8 \times (-1) - (-0.8) \times 1 = 0.$$
(2.11)

This indicates for baud-rate PD, usually some complicated scheme or algorithms are required, which is distinctive from previously introduced PDs.

[18] proposed algorithms and implementations for baud-rate PD. Fig. 2.14 is an implementation of a simple baud-rate PD from [18]. Average of  $y_k$  gives the phase information. The mathematical representation of Fig. 2.14 is (2.3).



Figure 2.14: Muller-Muller PD.

The core idea of the PD in [18] is the pulse response the CDR receives. Suppose the transmitter sends out a pulse, then, we could like the CDR to sample the received pulse signal near its maximum opening, getting value  $h_0$ . During the previous clock cycle, CDR will get value  $h_{-1}$ , and  $h_1$  for the cycle after  $h_0$ . Ideally,  $h_{-1} = h_1$  if *CLK* is sampling at the maximum opening. If *CLK* is early,  $h_{-1} < h_1$ . If *CLK* is late,  $h_1 > h_1$ .

Therefore, the key is to extract  $h_{-1}$  and  $h_1$  from the input symbols and various algorithms are proposed. Given a pulse response of  $h_n$ , the signals received by the CDR can be represented by:

$$d_{\mathbf{k}} = \sum_{n} D_{\mathbf{n}} \times h_{\mathbf{k} \cdot \mathbf{n}},\tag{2.12}$$

where  $D_n$  denotes the symbols sent by the transmitter.

To extract  $h_{-1}$  and  $h_1$  from  $d_k$ , received signals  $d_k$  need to be processed along with data dependent coefficients. One example of such coefficient is quantized result of  $d_k$ ,  $q_k$  as in Fig. 2.14.



Figure 2.15: Analog CDR loop filters.



Figure 2.16: (a) LC VCO, (b) Ring VCO.

### 2.1.6 Other CDR Components

Besides PD, other CDR components are also critical to CDR's performance. In Fig. 2.4, there are also: *Loop Filter*, *VCO*, *Clocking*, and *Data Extraction*.

In an analog CDR, the most basic form of *Loop Filter* is a series of a resistor and a capacitor as in Fig. 2.15(a). A common loop filter choice is Fig. 2.15(b). Fig. 2.15(a) is often used in analog bang-bang-PD CDR circuits, in which  $R_1$  acts like a proportional branch and  $C_1$  acts like an integral branch [19]. Fig. 2.15(b) is often used in analog linear-PD CDR circuits, where the loop filter  $(R_1, C_1, C_2)$  decides the response of the CDR models [20]. *Loop Filter* transforms output from PD to input for VCO, therefore, for digital CDR architecture, its loop filter is often a DSP unit.

Fig. 2.16 shows two types of popular VCO designs. LC VCO usually has larger area (due to the large inductor area) but lower phase noise.

Clocking and Data Extraction in a CDR manifest themselves in different ways. In a fullrate CDR, no extra steps are needed to generate necessary clocks. While for a quarter-rate CDR with a full-rate VCO, a divide-by-four clock divider circuit is required. As for Data Extraction, this unit is often inherent in the PD, especially for NRZ CDR. When aligned, for an Alexander PD, B has the extracted data in Fig. 2.5. Similarly, flip-flops in Fig. 2.9 have extracted data for a Hogge PD.

# 2.2 PAM4 Clock and Data Recovery

For PAM4 clock and data recovery circuits, one of the most challenging tasks is: how to decide early or late given a PAM4 signal. In NRZ CDR circuits, it is often assumed that signals would be like digital signals ( $\theta$  or 1), as revealed by the digital circuits such as XOR gates and flip-flops in Fig. 2.5, Fig. 2.9, Fig. 2.12, and Fig. 2.14. This is also exemplified in waveforms demonstrating how these PDs work, as we think of these waveforms taking values of  $\theta$  or 1.



Figure 2.17: A 4-level PAM4 eye and the three levels of slicers needed.

On the other hand, data recovery is very different for PAM4 signals. Assume the CDR loop is locked ideally, for each data symbol period, the data extraction block will receive one of the four analog levels in a PAM4 signal. Since each PAM4 symbol contains 2 bits of information, An "ADC" is necessary to convert a 2-bit symbol into 2 bits of digital signals. A minimal way of realizing this "ADC" is to use a combination of three slicers, which is equivalent to a flash ADC. Assume the four analog levels in a PAM4 signal are 3, 1, -1, and -3, the necessary slicer levels will be 2, 0, and -2 as shown in Fig. 2.17.



Figure 2.18: A PAM4 signal transformed into digital signals.

Following Fig. 2.17, Fig. 2.18 illustrates the basic ideas for PAM4 data extraction. Ideally, one stream of PAM4 input signal is transformed into three streams of thermal codes by three ideal comparators (where slicers are used in real circuit designs). Therefore, each PAM4 symbol corresponds to three thermal codes. These three thermal codes can be combined into two bits of digital signal via a certain coding scheme, such as Grey code. For example, when a PAM4 level 1 is sent, the thermal code is 011 (corresponding to slicers 2, 0, and -2), and lastly the Grey code is 11.

Muller-Muller PD can be readily modifed into a PAM4 PD ([21], [22], and [23]). Since it only uses extracted data and such data are always available in digital forms as explained. The algorithm in such PD needs to account for PAM4 data formats. For example, Equation (2.3) might need to support three ds and three qs if thermal-code results are sent to a Muller-



Figure 2.19: (a) Ideal Hogge PD waveforms, (b) A more realistic representation.

Muller PD.

Just as traditional Muller-Muller PD which was designed for NRZ communications are modified to support PAM4, a question arises: could other types of NRZ PD be made into PAM4 PD?

Modifying a Hogge PD to let it support PAM4 is a very difficult task. First, let us dig more into NRZ Hogge PD. In Fig. 2.10, the onset of *Error* pulse is when the rising/falling edge of  $D_{\rm in}$  reverses the output of its XOR gate. Ideally, this  $D_{\rm in}$  should be rail-to-rail and have a rise/fall time of almost zero. In reality, due to channel loss and circuit bandwidth limitations, this  $D_{\rm in}$  has a very limited swing and large rise/fall time (limited vertical eye opening).

Fig. 2.19 demonstrates this effect. Fig. 2.19(a) shows the waveforms in an ideal Hogge PD (Fig. 2.9). Rising edges of CLK' are aligned at the middle points of each symbol period of  $D_{\rm in}$ . Therefore, *Error* pulse has the same amplitude and pulse as *Reference* pulse.

Due to channel loss and other effects,  $D_{in}$  will have a significantly large rise/fall time. On the other hand, due to bandwidth limitations of the circuits, A, B, Error, and Reference also have significant rise/fall time. The amplitude of *Error* and *Reference* also depend on their inputs,  $D_{in}$ , A, and B, especially for high-speed CDR circuits.

Fig. 2.19(b) demonstrates the effects of above factors. Still at 0 phase error, *Error* pulse now is significantly smaller than *Reference* pulse in that it has a shorter period and smaller amplitude. *Reference* pulse suffers small change from (a) to (b) in Fig. 2.19 since **A** always lags *B* by one clock period, and their swings are usually large. *Error* pulse however suffers significant change from (a) to (b) in Fig. 2.19 due to  $D_{in}$ . The onset of *Error* pulse is significantly delayed due to rise/fall time of  $D_{in}$ , but the end of *Error* pulse does not extend equally to compensate for this delay due to the relatively high speed of *B*. The amplitude of *Error* is also smaller due to the limited swing of  $D_{in}$ .

Above analysis shows that for a NRZ Hogge PD,  $D_{in}$  deteriorates the PD performance directly. It will introduce unwanted phase error or even render the PD useless when *Error* pulse is too small.

Replacing NRZ input with PAM4 input for a Hogge PD, the above issue will be even more severe. Although a 3 to -3 and a -3 to 3 transition resemble a NRZ transition, other transitions will make the Hogge PD produce different results. For example, it will ignore transitions between 3 and 1 and between -1 and -3 mostly. A 3 to -1 transition will create a *Reference* pulse similar to a NRZ fashion, but the corresponding *Error* pulse will be even smaller than that in Fig. 2.19(b) since -1 represents a even smaller swing.

Alexander PD or bang-bang PD however can be readily modified into a PAM4 PD, since a bang-bang PD essentially compares an edge sample with two neighboring data samples as shown in Fig. 2.6 (where A and B represent data samples and E represent the edge sample). A PAM4 CDR must have three slicers or an ADC to sample the data points of the input signal to realize data extraction. If the same is applied to the edge points of the input signal, then this PAM4 CDR will have data sample representation as well as edge sample representation in digital forms.

Illustrated in Fig. 2.20, a PAM4 input is sampled at both data points  $(d_1, d_2, and d_3)$ and edge points  $(e_1 \text{ and } e_2)$  by three slicers. Therefore, after the slicers, later circuit stages



Figure 2.20: Sampling both data points and edge points with slicers.



Figure 2.21: Combining results from three PD units.

in the CDR sees 111 for  $d_1$ , 011 for  $e_1$ , 000 for  $d_2$ , 001 for  $e_2$ , and 011 for  $d_3$  (in which the first digit is the result of +2 slicer, the second 0 slicer, and the third -2 slicer). Therefore, the challenge becomes: given these digital information, how can a PD decide if clock is early or late?

A simple answer would be: what if we repeat a NRZ bang-bang PD unit (for example Fig. 2.5) three times and apply corresponding slicer outputs to each unit. Applying this to Fig. 2.20, for the  $d_1-e_1-d_2$  transition, +2-slicer bang-bang PD decision is *late*, the 0-slicer decision is *early*, and the -2 slicer is *early*. For  $d_2-e_3-d_4$ , the decisions are *nil*, *early*, *late*. Therefore, different decisions are made in this process. However, the eyeballing answer is that the clock is *early*. So clearly this simple solution needs to be improved. The decisions of all three units need to be combined together to produce an output to the loop filter and



Figure 2.22: A simple method to combine results from three PD units.

VCO as shown in Fig. 2.21.

One of the simplest methods to combine these three PD units is to let each PD drive its own charge pump and tie these three charge pumps together [24]. However, such simple design faces a significant challenge as shown in Fig. 2.22.

Fig. 2.22 shows how this design will behave when the input data symbol transitions from 3 to -1 and there is no phase error between input data and sampling clock. Ideally, since there is no phase error, zero current should go to the loop filter and the VCO.

To the level-2-comparator PD unit, it sees a high to low transition and the edge point is low, so it decides that the clock is late, and its corresponding CP unit produces a positive current. To the level-0-comparator PD unit, it also sees a high to low transition but the edge point is high, so it decides that the clock is early, and its corresponding CP produces a negative current on the contrary. The -2 unit sees no transition so no output is generated.

Ideally, the two currents from level-2 and level-0 units should cancel each other out. However, due to mismatch, a residual current will go to the loop filter and the VCO's frequency will change. This effect will introduce significant jitter generation and make the loop lock to a less than optimal phase hence reducing jitter tolerance and increasing BER. On the other hand, this effect also exists for transitions between 3 and -3 as currents from 2 unit and -2 unit do not cancel each out perfectly.



Figure 2.23: Three types of transitions in PAM4: (a) major transitions, (b) middle transitions, and (c) minor transitions.

The above analysis shows that, for a PAM4 PD that is based on comparators/slicers/ADC and NRZ Alexander/bang-bang PD, different PAM4 transitions have different effects on the output to the loop filter and the VCO. PAM4 transitions can be classified into three categories as shown in Fig. 2.23. Major transitions are defined as transitions that cross three slicer levels, middle transitions are defined as those that cross two slicer levels, and minor transitions are defined as those that cross one slicer level. For Fig. 2.22, only minor transitions avoid aforementioned issues. Therefore, to combine the results of the three units, a more sophisticated scheme is often required.

A popular method to realize combining is majority voting, which makes one decision regarding whether clock is early or late based on three decisions corresponding to three slicers and uses one charge pump unit as shown in Fig. 2.24. Such method can be found in [25], [26], [12], [27].

With majority voting, when there is a 3 to -1 transition as in Fig. 2.22, since the decisions are *late*, *early*, and *nil* respectively, the final decision is simply that there is no decision and the charge pump unit produces no current.

To realize majority voting, extensive digital logic is often required whether it is an analog CDR or digital CDR. On the other hand, when there is a minor transition, the decision of the PD is very sensitive to comparator/slicer offset. Therefore, it is also a good idea to realize a PAM4 PD with only level-0 comparator/slicer instead of all three. The intuition behind



Figure 2.24: Majority Voting.

such method is: if a PAM4 input signal only transitions between 3 and -3 or between 1 and -1, then it is equivalent to a NRZ input. So the challenge is becomes: how to design such a PD that only sees transitions between 3 and -3 and/or between 1 and -1?

In [28], level-0 slicer output goes to an Alexander PD as in Fig. 2.5. However, outputs of +2-slicer and -2-slicer are used to enable this PD only if the transition is between 3 and -3 or between 1 and -1.

In [29] and [10], the basis of the PD can be illustrated in Fig. 2.25.

Suppose the signal input to a CDR is NRZ, when clock is aligned, the data sample should be in the middle of the symbol (not shown in Fig. 2.25), and the edge sample should ideally in the middle of the transition, which is where the differential input signal crosses 0. When there is a phase error, clock does not sample the middle of the transition and the resulted edge sample voltage is not 0 as a result.

Fig. 2.25(a) shows a low to high transition when the clock is late. As a result, the sampled edge point voltage,  $V_{\text{phase}}$  is a positive voltage. The more the clock lags the input, the larger the amplitude of  $V_{\text{phase}}$  is. Therefore, in this scenario, the positive polarity of  $V_{\text{phase}}$  shows that the clock is late. And the amplitude of  $V_{\text{phase}}$  shows how much late the clock is.

Fig. 2.25(b) shows what  $V_{\text{phase}}$  becomes when the clock is early for a low to high transition. The polarity becomes negative since the clock is early instead of late.



Figure 2.25: Deciding early or late based on sampled edge voltage.

To realize a PAM4 PD with such idea, the challenge is how to select correct transitions out of all transitions as shown in Fig. 2.23. In Fig. 2.23, major transitions are symmetrical against 0 so they can be used. Middle transitions are not symmetrical against 0 so they should be discarded. As for minor transitions, the one that crosses +2-level and the one that crosses -2-level do not cross 0, so they should be discarded. However, the minor transitions that cross 0 is symmetrical against 0, so they can be used. So two out of six transition types are potential candidates for this scheme.

[29] uses both types and [10] uses only major transitions.

Other NRZ PD designs can also be modified into PAM4 PD designs. [30] proposes a NRZ PD which realizes phase detection by mixing sampling clock with input data transition pulses. These transition pulses are generated by passing both input data signal and its delayed version to an XOR gate so that whenever there is a transition there will be a pulse generated. Therefore, the timing of this NRZ PD depends on zero-crossing of the input data signal.

If applying a PAM4 signal to this PD, major transitions and minor transitions that cross zero in Fig. 2.23 provide correct timing, but the middle transitions provide incorrect timing since their zero-crossings deviate from the middle of the transition. Since this deviation will cancel each out in the long run, this PD is able to provide correct phase information given a PAM4 signal.

# CHAPTER 3

# Design of a 56-Gb/s 8-mW PAM4 CDR with High Jitter Tolerance

# 3.1 Overview of the Architecture

Fig. 3.1 shows an overview of the architecture to highlight a few key points of the proposed circuit:

- 1. One-eighth rate is applied to *PD* blocks, *CP* blocks, as well as the *Data Extraction* part (which is composed of multiple data extraction blocks). Therefore, these blocks are driven by 3.5-GHz clocks. Eight *PD* blocks are required to do phase detection and eight data extraction blocks (not shown here) are needed to extract the data.
- 2. The proposed *PD* detects all transitions and makes phase decision instead of using the information from the data extraction blocks, which only turns off *CP* when there is no data transition.
- 3. For the *PD* block, a background offset-cancellation scheme is proposed and an offsetcancellation comparator for this scheme is developed.
- 4. The *CP* block receives two differential inputs from its *PD* block to generate proper currents for the *loop filter* and uses corresponding data extraction results to turn itself off when there is no data transition.
- 5. Outputs of the 28-GHz LC VCO go to the Clock Generation block, which creates proper 3.5-GHz Clocking with minimum power consumption.



Figure 3.1: Overview of the architecture.

# **3.2** Phase Detector Basis

#### 3.2.1 Euclidean Distance

To understand the operation and the intuition of the proposed phase detector, it is helpful to go back to the NRZ Alexander bang-bang PD and take a closer look.

Fig. 3.2 captures the essence of Alexander PD in Fig. 2.5 or a half-rate bang-bang PD in Fig. 2.8.  $V_{\rm A}$  and  $V_{\rm B}$  are the data samples and  $V_{\rm E}$  is the transition sample between  $V_{\rm A}$  and  $V_{\rm B}$ . Ideally, we want to sample  $V_{\rm A}$  and  $V_{\rm B}$  in the middle of the symbol period to get maximum vertical opening.

The XOR of  $V_{\rm A}$  and  $V_{\rm E}$  in Fig. 3.2 correspond to the XOR of A and E in Fig. 2.5. And the XOR of  $V_{\rm E}$  and  $V_{\rm B}$  in Fig. 3.2 correspond to the XOR of E and B in Fig. 2.5. Alexander PD detects phase by deciding if the edge sample is the same as the previous data sample or the subsequent data sample.



Figure 3.2: Alexander PD detects (a) clock late, and (b) clock early.

Fig. 3.2(a) is a low to high transition when the clock is late. Therefore,  $V_{\rm A}$  is 0,  $V_{\rm B}$  is 1, and  $V_{\rm E}$  is 1. The output of the Alexander PD is thus 1 for Up and 0 for *Down*. As a result, the charge pump will produce a current pulse that goes into the loop filter.

Fig. 3.2(b) is a low to high transition when the clock is early. Therefore,  $V_{\rm A}$  is 0,  $V_{\rm B}$  is 1, and  $V_{\rm E}$  is 0. The output of the Alexander PD is thus 0 for Up and 1 for *Down*. As a result, the charge pump will produce a current pulse that leaves the loop filter.

Comparing Fig. 3.2(a) and Fig. 3.2(b), the distinction is whether  $V_{\rm E}$  is the same as  $V_{\rm A}$  or  $V_{\rm B}$ . However, Fig. 3.2 is an idealized NRZ waveform with sharp transitions and no distortions, real waveforms will look more like Fig. 3.3.

In Fig. 3.3,  $V_{\rm A}$  and  $V_{\rm B}$  are still sampled at the peak but  $V_{\rm E}$  is sampled during the transition. If we define  $V_{\rm A}$  here as 0 and  $V_{\rm B}$  here as 1, then the value of  $V_{\rm E}$  is between 0 and 1. By observation, when clock is late,  $V_{\rm E}$  is closer to  $V_{\rm B}$  than to  $V_{\rm A}$ . When clock is early,  $V_{\rm E}$  is closer to  $V_{\rm B}$  than to  $V_{\rm A}$ .

Therefore, the intuition is, if we calculate the Euclidean distance between the edge sample and its previous data sample as well as the Euclidean distance between the same edge sample and its subsequent data sample, could we determine if the clock is early or late by comparing these two distances? Furthermore, could this intuition be applied to PAM4?

Fig. 3.4 shows a low-to-high transition in a PAM4 signal.  $V_{\rm A}$  and  $V_{\rm B}$  are still data



Figure 3.3: Alexander PD detects (a) clock late, and (b) clock early with more realistic waveforms.



Figure 3.4: Measuring Euclidean distances for PAM4 when (a) the clock is late and (b) the clock is early.



Figure 3.5:  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  for all PAM4 transitions.

samples and  $V_{\rm E}$  is still the edge sample. The Euclidean distance between the edge  $V_{\rm E}$ and its previous data sample  $V_{\rm A}$  is  $V_{\rm E} - V_{\rm A}$ , and the Euclidean distance between  $V_{\rm E}$ and its subsequent data sample  $V_{\rm B}$  is  $V_{\rm B} - V_{\rm E}$ . The difference of these two distances is  $(V_{\rm B} - V_{\rm E}) - (V_{\rm E} - V_{\rm A}) = V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .

In Fig. 3.4(a), the clock is late. By observation,  $V_{\rm E}$  is closer to  $V_{\rm B}$  than to  $V_{\rm A}$ , therefore, here  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} < 0$ .

In Fig. 3.4(b), the clock is early. By observation,  $V_{\rm E}$  is closer to  $V_{\rm A}$  than to  $V_{\rm E}$ , therefore, here  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} > 0$ .

Summarizing the aforementioned, it is shown that when the clock is late,  $V_{\rm A} + V_{\rm B} - 2 V_{\rm E} < 0$ , and when the clock is early,  $V_{\rm A} + V_{\rm B} - 2 V_{\rm E} > 0$ .

 $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  for phase detection applies for all types of PAM4 transitions as shown in Fig. 3.5. Whether it is a major transition (Fig. 3.5(a)), a middle transition (Fig. 3.5(b)), or a minor transition (Fig. 3.5(c)),  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} = 0$  always happens right at the midpoint of a transition. When the clock is late,  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  is negative for all transitions highlighted. When the clock is early,  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  is positive for all transitions highlighted.

Having established that  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  can be the foundation of a phase detector, the next challenge is how to realize  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .

#### 3.2.2 Timing



Figure 3.6: Timing challenge for  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .

For a 56-Gb/s PAM4 data input, the symbol period is roughly 35.72 ps. Therefore, the time difference between  $V_{\rm A}$  and  $V_{\rm E}$  or between  $V_{\rm E}$  and  $V_{\rm B}$  is 17.86 ps as shown in Fig. 3.6. To realize  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ , we need to align them so that all three can be present in a circuit at the same time.

In Fig. 2.5, this alignment is done by the flip-flops. The bottom two flip-flops retime  $V_{\rm E}$  to align it with  $V_{\rm B}$ , and the top two flip-flops retime  $V_{\rm A}$  to align it with  $V_{\rm B}$ .

Retiming  $V_A$ ,  $V_E$ , and  $V_B$  with flip-flops, however, is equivalent to passing them through a comparator, which will remove the amplitude information of  $V_A$ ,  $V_E$ , and  $V_B$ . This removal will render the subsequent operation of  $V_A + V_B - 2V_E$  useless since  $V_A + V_B - 2V_E$  requires both polarity and amplitude information of these samples.

Following up to this retiming method, the next attempt would be to try to retime  $V_{\rm A}$  and  $V_{\rm E}$  while maintaining its linearity. However, this task is rather difficult. Even charge-steering logic in [31] introduces significant non-linearity. On the other hand, if linearity is perfectly maintained, then offsets of the retiming circuits will translate to the output as well, hence introducing significant phase error.

As a result, it is better to realize alignment without using retiming. The proposed method therefore is to use three 3.5-GHz (one-eighth rate) clocks to drive three sample-and-



Figure 3.7: One-eighth rate for  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .

hold (S/H) switches to sample  $V_{\rm A}$ ,  $V_{\rm E}$ , and  $V_{\rm B}$  respectively and these three clocks are 17.86 ps or 22.5° apart as shown in Fig. 3.7.

In Fig. 3.7, the S/H switch controlled by  $CK_j$  tracks PAM4 input when  $CK_j$  is low. At  $t_0$ , the rising edge of  $CK_j$ , voltage value of  $V_A$  is being held at the output of this S/H switch. Similarly, at  $t_1$ , S/H switch of  $CK_{j+1}$  holds  $V_E$ . At  $t_2$ , S/H switch of  $CK_{j+2}$  holds  $V_B$ . At  $t_3$ ,  $CK_j$  S/H switch goes back to track mode and its output no longer holds  $V_A$ .

All clocks here are 50% duty-cycle 3.5-GHz signals, therefore the hold time is about 142.8 ps.  $V_{\rm B}$  is the last voltage to be available  $(t_2)$  since it is the second data point sample, and  $V_{\rm A}$  is the first voltage to disappear  $(t_3)$  since it is the first data point sample. The timing window when  $V_{\rm A}$ ,  $V_{\rm E}$ , and  $V_{\rm B}$  are all available is between  $t_2$  and  $t_3$ , which is about 107 ps.

By using one-eighth rate and S/H switches,  $V_A$ ,  $V_E$ , and  $V_B$  can coexist in the circuit for 107 ps. Such unit needs to be repeated eight times as shown in Fig. 3.1 since it is one-eighth rate. The first unit uses  $CK_0$ ,  $CK_1$ , and  $CK_2$ . The second unit uses  $CK_2$ ,  $CK_3$ , and  $CK_4$ , and so on and so forth. The eighth unit uses  $CK_{14}$ ,  $CK_{15}$ , and  $CK_0$ .  $CK_0$ ,  $CK_1$ , ..., and  $CK_{15}$  are all 3.5 GHz and 22.5° apart.

Choosing one-eighth rate is not just a fluke. If using quarter-rate clocking as shown in Fig. 3.8, the timing window when  $V_A$ ,  $V_E$ , and  $V_B$  are all available reduces to 35.72 ps. Therefore, the bandwidth requirement for a quarter-rate design is three times that of the



Figure 3.8: Quarter rate for  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  as a comparison.

one-eighth rate, which usually means a three times increase in power for each unit. Although the quarter-rate design reduces the number of units from eight to four, the overall power consumption is 1.5 times that of the one-eighth-rate.

On the other hand, choosing one-sixteenth rate seems to further increase the timing window. However, this will also double the number of units needed and double the size of the trace for the input signal. This lowers the input bandwidth such that the resulted input eye diagram received at the input of the S/H switches will be significantly affected.

To sum it up, choosing one-eighth rate gives ample time for  $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$  to be generated and ensures that the S/H switches receive an eye diagram with enough opening.

#### 3.2.3 The Circuit to Generate the Euclidean Distance

In generating  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ . the linearity is of the utmost importance since DC offsets could be corrected later on but if non-linearity is introduced it is very difficult to reverse the final result. Therefore, Fig. 3.9 is designed to generate  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .

To ensure linearity, source degeneration is used and  $R_{\rm S}$  is 20 k $\Omega$ .

To reduce power consumption while maintaining a certain circuit bandwidth, it is preferable to increase  $R_{\rm D}$  and to reduce the sizes of  $M_1 - M_{10}$ . However, widths of  $M_1 - M_{10}$ 



Figure 3.9: The circuit for  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ .

should not decrease indefinitely. [20] shows

$$\Delta V_{\rm TH} = \frac{A_{\rm VTH}}{\sqrt{WL}},\tag{3.1}$$

$$\Delta(\mu C_{\rm ox} \frac{W}{L}) = \frac{A_{\rm K}}{\sqrt{WL}},\tag{3.2}$$

where both  $\Delta V_{\text{TH}}$  and  $\Delta(\mu C_{\text{ox}})$  contribute to DC offsets and increase as the area of a transistor (WL) decreases. Even though the proposed offset cancellation scheme can significantly reduce DC offsets (will be explained later), to keep the residual effective offsets below a certain level,  $M_1 - M_{10}$  have a W of 500 nm and a L of 30 nm.

Based on the sizes of  $M_1 - M_{10}$  and the load of the subsequent stage,  $R_D$  is 8 k $\Omega$ .

The  $\,V_{\,\rm cal}$  differential pair is for background offset cancellation.

Fig. 3.10 shows simulated waveforms of Fig. 3.9. In this simulation,  $V_{\text{cal}}$  and  $V_{\text{bcal}}$  are grounded. The first plot shows differential waveforms of  $V_A$ ,  $V_E$ , and  $V_B$ . We can clearly observe that  $V_A$ ,  $V_E$ , and  $V_B$  goes into the holding mode one after another and then goes into the sample mode one after another as well.

The second plot of Fig. 3.10 shows the differential waveform of  $V_{\text{out}}$  in Fig. 3.9. The desired  $V_{\text{A}} + V_{\text{B}} - 2V_{\text{E}}$  is available a short time after  $V_{\text{B}}$  is ready. When all three switches are at the sampling mode,  $V_{\text{out}}$  is zero since  $V_{\text{A}} + V_{\text{B}} - 2V_{\text{E}} = 0$  if all three follow the input signal. If the three switches are of different modes,  $V_{\text{out}}$  fluctuates as the input PAM4 signal



Figure 3.10: Simulated waveforms for the  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  circuit.

changes.

#### 3.2.4 The Need for Two Euclidean Distances

So far, it is eastblished that  $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$  is the foundation of the proposed phase detector, and the circuit in Fig. 3.9 is able to produce a differential signal representing  $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$ in the 107-ps timing window as shown in Fig. 3.7.

However,  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  by itself is not able to correctly determine if the clock is early or late. Fig. 3.11 shows all four scenarios depending on whether the clock is early or late and whether it is a low-to-high or high-to-low transition. By looking at Euclidean distances, we find in Fig. 3.11:

(a)  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} < 0$ ,



Figure 3.11: Low-to-high transitions and high-to-low transitions in PAM4.

- (b)  $V_{\rm A} + V_{\rm B} 2V_{\rm E} > 0$ ,
- (c)  $V_{\rm A} + V_{\rm B} 2V_{\rm E} > 0$ ,
- (d)  $V_{\rm A} + V_{\rm B} 2V_{\rm E} < 0.$

When  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} < 0$ , it could be (a) late or (d) early; and when  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} > 0$ , it could be (b) late or (c) early. Therefore,  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  by itself introduces ambiguity in phase decision. The reason clock being late or early could lead to same decision is that  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  does not contain the sign of the transition. To fix this issue, the sign information needs to be produced as well.

Since  $V_{\rm A} - V_{\rm B}$  represents the sign of the transition, the complete mathematical equation for the proposed phase detector is:

$$Error = (V_{\rm A} + V_{\rm B} - 2V_{\rm E}) \times (V_{\rm A} - V_{\rm B}).$$
(3.3)

Equation (3.3) is positive when the clock is late, and it is negative when the clock is



Figure 3.12: The circuit for  $V_{\rm A} - V_{\rm B}$ .

early. It detects all transitions as shown in Fig. 3.5. One thing to note is that Equation (3.3) represents the logic of the proposed phase detector, it does not mean that two signals ( $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  and  $V_{\rm A} - V_{\rm B}$ ) are literally multiplied together.

 $V_{\rm A} - V_{\rm B}$  can be generated in a similar way to  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  as shown in Fig. 3.12.

Fig. 3.13 shows the simulated waveforms for Fig. 3.12.

Two  $V_{\text{out}}$  signals from Fig. 3.10 and Fig. 3.13 are not ready to drive charge pumps yet for three issues. First, These two signals are only meaningful for 107 ps out of 285.7 ps (period for a 3.5-GHz periodical clock), therefore, outside this time range, the multiplication of these two signals is unpredictable. Fig. 3.14 shows waveforms of  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  and  $V_{\rm A} - V_{\rm B}$  when the clock is late and the result of their multiplication. Ideally, *Error* should always be positive since the clock is late. However, glitches are present and there are also negative-voltage pulses. This will reduce the phase detector gain and increase the phase noise.

Second, as shown in Fig. 3.13, the amplitudes of  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  and  $V_{\rm A} - V_{\rm B}$  are too small to drive logic gates that usually require rail-to-rail inputs. If they are used to drive a current-mode differential multiplier circuit, offsets of this multiplier must be very



Figure 3.13: Simulated waveforms for the  $V_{\rm A} - V_{\rm B}$  circuit.



Figure 3.14: Multiplication of  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  and  $V_{\rm A} - V_{\rm B}$  when the clock is late.

small. Furthermore, offsets of subsequent circuits also translate back to this stage. To limit all these offsets, large transistor sizes are required, which will increase power consumption and/or reduce circuit bandwidth.

Third, the DC offset in Fig. 3.9 and Fig. 3.12 will be left untreated if  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ and  $V_{\rm A} - V_{\rm B}$  are directly multiplied together. The effects of the DC offset are as following: instead of getting  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  and  $V_{\rm A} - V_{\rm B}$ , the outputs of the circuits are essentially  $V_{\rm A} + V_{\rm B} - 2V_{\rm E} + \Delta_1$  and  $V_{\rm A} - V_{\rm B} + \Delta_2$ , where  $\Delta_1$  and  $\Delta_2$  are the total input-referred DC offsets in Fig. 3.9 and Fig. 3.12 respectively.

#### 3.2.5 Next Step

These three issues mean that the two  $V_{out}$  signals in Fig. 3.10 and Fig. 3.13 need to be processed before driving charge pump circuits. To fix the first issue, the unwanted time period of  $V_{out}$ ,  $V_{out}$  can be retimed during this 107-ps window. To fix the second issue, small  $V_{out}$  voltage swing,  $V_{out}$  could be retimed by a comparator.

Implementing this change, equation (3.3) effectively becomes:

$$Error = sign(V_{\rm A} + V_{\rm B} - 2V_{\rm E}) \times sign(V_{\rm A} - V_{\rm B}), \qquad (3.4)$$

where  $sign(\cdot)$  indicates the sign of a number.

Equation (3.4) shows that the proposed PD is a bang-bang PD.

To fix the third issue, an offset-cancellation scheme is required. Combine the above three solutions, a tentative/conventional approach is shown in Fig. 3.15.

In Fig. 3.15,  $G_{m1}$  represents one source-degenerated differential cell in Fig. 3.9. To avoid introducing additional offset, the offset-sensing circuit (in red) must have a low offset itself. Therefore, it will present a large load a node X, reducing the bandwidth significantly. On the other hand, the offset of the comparator is left untreated. Therefore, a new approach is proposed.



Figure 3.15: Conventional offset cancellation approach.



Figure 3.16: Concepts of the proposed offset-cancellation comparator.

## 3.3 Offset Cancellation

#### 3.3.1 Concepts

Fig. 3.16 shows the concepts of the proposed offset-cancellation scheme. As in Fig. 3.15,  $G_{m1}$  represents a source-degenerated cell in Fig. 3.9.  $S_1$  follows  $V_{out}$  of the circuit for  $V_A + V_B - 2V_E$ . It samples  $V_A + V_B - 2V_E$  at the correct moment (when  $V_A$ ,  $V_E$ , and  $V_B$ are all ready and the corresponding  $V_{out}$  is nearly settled) and generates a nearly rail-to-rail differential output based on the sign of  $V_A + V_B - 2V_E$ . The DC offset  $\Delta_1$  is stored at  $C_3$ . Offset cancellation is achieved by amplifying  $\Delta_1$  and send a current signal back to node X via  $G_{m2}$ , thus forming a negative feedback loop for the DC offset.

The operations of this scheme can be logically divided into three phases.



Figure 3.17: Phase 1 of the proposed offset-cancellation scheme.

In phase 1 (Fig. 3.17),  $S_1$  and  $S_5$  are on;  $S_3$  and  $S_7$  are off. Therefore, the voltage at node X will be amplified and stored at  $C_1$ .



Figure 3.18: Phase 2 of the proposed offset-cancellation scheme.

In phase 2 (Fig. 3.18),  $S_5$  is off and  $S_7$  is on. Therefore, the charge stored at  $C_1$  and  $C_3$  will be shared and the voltages of the two capacitors become the same.

 $C_1$  is realized with transistor parasitics (at the level of 1 fF) and  $C_3$  is a large fringe capacitor (at the level of 1 pF). Therefore, over the long run, the voltage at  $C_3$  is an average of the voltage at node X after being amplified by  $A_1$ . Another way to look at it is that  $S_5 C_1 - S_7 - C_3$  form a switched-capacitor low-pass filter. Either way to analyze, the voltage



Figure 3.19: Phase 3 of the proposed offset-cancellation scheme.

at  $C_3$  will be  $\Delta_1$  with some gain since the average or low-pass-filtering of  $V_A$ ,  $V_E$ , and  $V_B$  is zero.

In phase 3 (Fig. 3.19),  $S_1$  and  $S_5$  are off;  $S_3$  is on. A positive feedback is formed for  $A_1$ , thus a nearly rail-to-rail differential signal can be generated based on the sign of the voltage at node X.

#### 3.3.2 Proposed Comparator Circuit

Fig. 3.20 is the proposed comparator that is essential to the offset-cancellation scheme.  $V_{\rm in}$  follows  $V_{\rm out}$  in Fig. 3.9 or Fig. 3.12.  $V_{\rm out}$  is a nearly rail-to-rail signal for half of the 3.5 GHz clock period.  $V_{\rm os}$  has the DC-offset voltage.

 $M_1$  and  $M_2$  differential pair is equivalent to  $A_1$  in Fig. 3.16.  $S_5 - C_1 - S_7 - C_3$  and  $S_6 - C_2 - S_8 - C_4$  differential path is equivalent to  $S_5 - C_1 - S_7 - C_3$  in Fig. 3.16.  $G_{m2}$  in Fig. 3.16 will be explained later.

In Fig. 3.20, all switches are controlled by 3.5-GHz clock signals.  $S_1$  and  $S_2$  are controlled by one 50% duty-cycle clock.  $S_3$ ,  $S_4$ , and  $S_T$  are controlled by its complementary. Ignore other circuit components for now, these switches decide the configurations of  $M_1$  and  $M_2$ .

When  $S_1$  and  $S_2$  are on and  $S_3$ ,  $S_4$ , and  $S_T$  are off as shown in Fig. 3.21(a),  $M_1$ ,  $M_2$ , the bias transistor  $M_3$ , and two resistors form a differential amplifier and therefore tracks and amplifies  $V_{in}$ . At this moment, if  $V_{in} > 0$ , then  $V_{out} > 0$ , and vice versa.



Figure 3.20: Proposed comparator.



Figure 3.21: Two configurations: (a) differential configuration, and (b) regenerative configuration.



Figure 3.22: Timing of  $S_1$ ,  $S_2$ ,  $S_3$ ,  $S_4$ , and  $S_T$ .

When  $S_1$  and  $S_2$  are off and  $S_3$ ,  $S_4$ , and  $S_T$  are on as shown in Fig. 3.21(b),  $M_1$  and  $M_2$  form a cross-coupled pair with their drains connected to the tail  $S_T$ . In this configuration,  $M_1$  and  $M_2$  form a regenerative pair. If  $V_{out}$  is initially positive, then  $V_{out}$  will be amplified to a positive rail-to-rail differential signal. If  $V_{out}$  is initially negative, then  $V_{out}$  will be amplified to a negative rail-to-rail differential signal.

Fig. 3.22 shows the timing of  $S_1$ ,  $S_2$ ,  $S_3$ ,  $S_4$ , and  $S_T$ .  $V_A + V_B - 2V_E$  becomes available when  $CK_{j+2}$  goes into holding mode, which is  $t_0$ . The phase of the clock for  $S_1$  and  $S_2$  is 112.5° after  $CK_{j+2}$ . As a result, the proposed comparator stays in the differential configuration until  $t_1$ . The proposed comparator differentially amplifies  $V_A + V_B - 2V_E$  between  $t_0$ and  $t_1$ , which is 89 ps.

At  $t_1$ , the proposed comparator goes into the regenerative configuration. If  $V_A + V_B - 2V_E > 0$ , a positive rail-to-rail output will be generated. If  $V_A + V_B - 2V_E < 0$ , a negative rail-to-rail output will be generated.

At  $t_2$ , the proposed comparator goes back to the differential configuration and waits for



Figure 3.23: Timing of  $S_5$ ,  $S_6$ ,  $S_7$ ,  $S_8$ ,  $S_9$ , and  $S_{10}$ .

the next  $V_{\rm A} + V_{\rm B} - 2 V_{\rm E}$  to be available.

Fig. 3.23 shows the timing of the rest of the switches.  $S_5 - S_{10}$  are all 25% duty cycle. When  $S_5$  and  $S_6$  are on,  $C_1$  and  $C_2$  track the differential voltage  $V_{out}$ . Since this happens during the proposed comparator's differential configuration, the differential voltage at  $C_1$ and  $C_2$  when  $S_5$  and  $S_6$  turn off is:

$$V_{\rm C1} - V_{\rm C2} = A \times (V_{\rm A} + V_{\rm B} - 2V_{\rm E} + \Delta_1), \qquad (3.5)$$

where A is the total gain from the input of the  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  circuit to the output of the proposed comparator, and  $\Delta_1$  is the total input-referred DC offset contributed by both Fig. 3.9 or Fig. 3.12 and Fig. 3.20.

Simulation results in Fig. 3.24 shows that the proposed comparator first amplifies the input signal differentially, and then generates a nearly rail-to-rail output when the supply voltage is 0.85 V.  $V_{\rm C1} - V_{\rm C2}$  tracks  $V_{\rm out}$  during the proposed comparator's differential configuration.  $C_3$  and  $C_4$  are set to have a initial differential voltage of 0, that is why  $V_{\rm C1} - V_{\rm C2}$  goes back to zero when  $S_7$  and  $S_8$  are on.

When  $S_7$  and  $S_8$  are on,  $C_1$  and  $C_3$  ( $C_2$  and  $C_4$ ) are connected together. From  $V_{out}$  to  $V_{os}$ , the signal path is equivalent to a low-pass filter or an averaging process as mentioned before. Therefore, at the end of this signal path:



Figure 3.24: Simulated waveforms of the proposed comparator.

$$V_{\rm os} = A \times \Delta_1. \tag{3.6}$$

Equation (3.6) shows that the DC offset is extracted.

 $S_9$  and  $S_{10}$  reset gate and drain voltages of  $M_1$  and  $M_2$  during the transition from the regenerative configuration to the differential configuration. At the end of regenerative configuration, both gate voltages and drain voltages of  $M_1$  and  $M_2$  are nearly rail-to-rail signals. However, for the differential configuration, these voltages should have a commonmode determined by the bias and a differential-mode whose swing is significantly smaller than rail-to-rail. By resetting them, settling time is greatly reduced.

Fig. 3.25 shows the waveform of  $V_{os}$  in the presence of DC offsets. DC voltage sources are inserted after S/H switches such that  $\Delta_1$  in Equation (3.6) is 100 mV. A in Equation (3.6) is 1.05 based on simulation results. Initial voltages of both nodes for  $V_{os}$  are set to be common-mode voltages of  $V_{out}$  in the proposed comparator (Fig. 3.20) in its differential



Figure 3.25: Simulated waveforms of  $V_{\rm os}$  in the presence of offsets.

configuration. The simulation shows that  $V_{os}$  settles to around 100 mV, which matches the theoretical prediction.

### 3.3.3 Complete Offset-Cancellation Loop

With  $V_{\rm os}$  ready, the next step is to finish the offset-cancellation loop.  $V_{\rm os}$  in Fig. 3.20 is amplified by amplifier  $A_{\rm cal}$ . Large device sizes and careful layout techniques are applied to reduce the offset of  $A_{\rm cal}$ . The output of  $A_{\rm cal}$  goes to  $V_{\rm cal}$  in Fig. 3.9 or Fig. 3.12. Therefore, a negative feedback loop is formed for the DC offset, and the effective total input-referred offset are reduced.  $A_{\rm cal}$  plus the differential pair of  $V_{\rm cal}$  input in Fig. 3.9 are equivalent to  $G_{\rm m2}$  in Fig. 3.16.

To quantify the background offset-cancellation loop in Fig. 3.26 (a) or (b), a mathematical model can be set up as in Fig. 3.27.

In Fig. 3.27,  $\Delta_1$  and  $A_1$  are the equivalent model of the source-degenerated differential



Figure 3.26: Background offset-cancellation loops: (a)  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ , and (b)  $V_{\rm A} - V_{\rm B}$ .



Figure 3.27: A mathematical model for the background offset-cancellation loop.



Figure 3.28: The effect of the residual input-referred DC offset,  $\delta$ .



Figure 3.29: The total input-referred DC offset reduces to  $\delta$  over time in the closed loop. cells in terms of offsets and gain in Fig. 3.9 or Fig. 3.12, whereas  $\Delta_2$  and  $A_2$  represent the differential pair of  $V_{\text{cal}}$  input.  $\Delta_3$  and  $A_3$  represent the proposed comparator.  $\Delta_{\text{Acal}}$  and  $A_{\text{cal}}$  represent  $A_{\text{cal}}$  in Fig. 3.26.

Without this feedback loop ( $A_2$  and  $A_{cal}$ ), the total input-referred DC offset is  $\Delta_1 + \Delta_3/A_1$ . With this loop, it will be effectively reduced to  $\delta$  demonstrated in Fig. 3.28.

We can calculate this  $\delta$  using the model in Fig. 3.27, which is:

$$\delta^2 \approx \left(\frac{\Delta_1}{A_2 \times A_3 \times A_{\text{cal}}}\right)^2 + \left(\frac{\Delta_2}{A_1 \times A_3 \times A_{\text{cal}}}\right)^2 + \left(\frac{\Delta_3}{A_1 \times A_2 \times A_3 \times A_{\text{cal}}}\right)^2 + \left(\frac{\Delta_{\text{Acal}}}{A_1 \times A_3}\right)^2. \tag{3.7}$$

Equation (3.7) shows that, to reduce  $\delta$ , the key is to increase  $A_{\text{cal}}$  and to reduce  $\Delta_{\text{Acal}}$ .



Figure 3.30: Differential and single-ended waveforms of  $V_{out}$  of the proposed comparator. This is possible since the  $A_{cal}$  amplifier can be put far from the critical path to have high gain, low offset, and a large area.

A closed-loop simulation can also be done to verify this scheme. Following the setup for Fig. 3.25,  $\Delta_1$  is set to be 100 mV, the rest are set to be zero.  $A_2$  is set to be the same as  $A_1$ , and  $A_{cal}$  is set to be 15. Fig. 3.29 shows that the total input-referred offset drops to  $\delta$ .

#### 3.3.4 Retime the Output of the Proposed Comparator

Since the proposed comparator is able to generate a nearly rail-to-rail differential output as shown Fig. 3.24, its subsequent stage can be small in sizes to save power and the resulted large offset can be tolerated.

To realize the multiplication as in Equation (3.4), CMOS logic can be used. However, to



Figure 3.31: Timing of the retimer.

increase robustness, the output of the proposed comparator should be retimed before driving these CMOS logic.

Fig. 3.30 plots the differential signal of  $V_{out}$  in Fig. 3.20 as well its two single-ended signals. When the proposed comparator is in the regenerative configuration, one end of  $V_{out}$ goes to  $V_{DD}$  (0.85 V here), and the other goes to zero. However, during the differential configuration, these two voltages are not able to turn on or off CMOS transistors in a robust manner.

To avoid this, strongARM comparators ([32]) can be used to retime each  $V_{out}$  to generate return-to-zero signals. The extra power consumption is little since these comparator sizes are minimal. The outputs of these two comparators are ready to drive the CMOS logic to realize multiplication. They correspond to the two-bit signals at the output of each PD<sub>j</sub> block in Fig.3.1.

The clock for the retimer  $(CK_r)$  is 112.5° after the clock for  $S_3$ ,  $S_4$ , and  $S_T$  as shown in Fig. 3.31. Therefore, from the moment the input signal crosses the edge ( $V_E$  starts being



Figure 3.32: One data extraction unit.

held) to the moment the charge pump circuits produce corresponding currents, the delay is 5.5 UI.

#### **3.4** Data Extraction

Since  $CK_0$ ,  $CK_2$ ,  $CK_4$ ,  $CK_6$ ,  $CK_8$ ,  $CK_{10}$ ,  $CK_{12}$ , and  $CK_{14}$  are used to sample and hold data points in the phase detector units, when locked, they will be aligned to the middle of the symbol.

Therefore, each of these clocks can be used to drive a S/H switch followed by slicers to realize data extraction as shown in Fig. 3.32. The *Data Extraction* block in Fig. 3.1 needs eight such units since it is one-eighth rate.

A PAM4 signal has four levels, therefore, three slicers are needed to extract the analog input as in Fig. 2.17.  $d_1$  is the output of the level-2 slicer,  $d_2$  is the output of the level-0 slicer, and  $d_3$  is the output of the level--2 slicer. These three thermal codes in return-to-zero form will be converted to digital NRZ Gray-code form to drive a PAM4 transmitter for BER tests.

The slicers can be made low power since the slicers run at 3.5 GHz and their inputs come from S/H switches. One result of consuming lower power is higher offset, however, as long

Figure 3.33: Charge pump.

as this offset is within the vertical opening of the PAM4 eye, it is tolerable.

### 3.5 Charge Pump

Fig. 3.33 is one charge pump unit for one pair of  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  and  $V_{\rm A} - V_{\rm B}$  signals.

 $D_{\rm p}$  and  $D_{\rm n}$  are the differential outputs of the retimer for  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  (*D* for difference).  $S_{\rm p}$  and  $S_{\rm n}$  are the differential outputs of the retimer for  $V_{\rm A} - V_{\rm B}$  (*S* for sign).  $\overline{D_{\rm p}}$ ,  $\overline{D_{\rm n}}$ ,  $\overline{S_{\rm p}}$ , and  $\overline{S_{\rm n}}$  are the results of their corresponding signals passing through inverters respectively. Based on Equation (3.4), the charge pump should give out a current pulse to the loop filter when *D* and *S* are of the same polarity, and it should take in a current pulse from the loop filter when *D* and *S* are of the opposite polarity.

In Fig. 3.33,  $M_n$  is off and  $M_p$  is on when D and S are of the same polarity. When D and S are of the opposite polarity,  $M_n$  is on and  $M_p$  is off.

Since both D and S are return-to-zero signals, for half of the clock period,  $D_{\rm p}$ ,  $D_{\rm n}$ ,  $S_{\rm p}$ , and  $S_{\rm n}$  are all at  $V_{\rm DD}$  level. This voltage will turn off  $M_{\rm n}$  naturally by keeping the series NMOS switches on, but it will also turn off the series PMOS switches, thus keeping  $M_{\rm p}$  on. Therefore, control signals for the  $M_{\rm p}$  switches are inverted so that during the return-to-zero phase, these PMOS switches will be kept on to turn off  $M_{\rm p}$ .

When there is no data transition, D and S are still return-to-zero signals and a current pulse will still be generated by the charge pump. Therefore, it is desirable to turn the charge pump off when there is no data transition.

For a PAM4 data signal, a transition happens when the sampled data changes from one of the four levels (3, 1, -1, and -3) to a different level as defined in Fig. 2.17. Therefore, if we know the levels of  $V_{\rm A}$  and  $V_{\rm B}$ , we will know if there is a transition.

Since the data extraction units have thermal-code representations of  $V_{\rm A}$  and  $V_{\rm B}$ , we can simply "borrow" these digital return-to-zero signals out of the slicers. The three-bit thermalcode representations for  $V_{\rm A}$  are  $A_1$ ,  $A_2$ , and  $A_3$  (corresponding to  $d_1$ ,  $d_2$ , and  $d_3$ ). And for  $V_{\rm B}$  they are  $B_1$ ,  $B_2$ , and  $B_3$ . Extra retimers are needed for correct timing alignment.

To check if there is a transition between  $V_A$  and  $V_B$ , three logic operations can be executed:  $\overline{A_1 \oplus B_1}$ ,  $\overline{A_2 \oplus B_2}$ , and  $\overline{A_3 \oplus B_3}$ . If there is data transition, all three will be 0. For a minor transition in Fig. 2.23, one out of these three operations will be 1 and the rest will be 0. For a middle transition, two out of three will be 1. For a major transition, all three will be 1. Therefore,  $\overline{A_1 \oplus B_1} + \overline{A_2 \oplus B_2} + \overline{A_3 \oplus B_3}$  is 0 only if there is no data transition. Equation (3.4) is changed to:

$$Error = sign(V_{\rm A} + V_{\rm B} - 2V_{\rm E}) \times sign(V_{\rm A} - V_{\rm B}) \times (\overline{A_1 \oplus B_1} + \overline{A_2 \oplus B_2} + \overline{A_3 \oplus B_3}).$$
(3.8)

The XNOR of  $A_x$  and  $B_x$  is realized with series switches as seen in Fig. 3.33. The OR operation in Equation (3.8) is realized by tying the outputs of these three CP together.

### 3.6 Overall Phase Detector and Charge Pump Unit

Fig. 3.34 is a detailed diagram of a PD unit with its associated CP unit to show the components covered previously.

Fig. 3.35 shows the simulated average charge pump output versus the phase error. When



Figure 3.34: An overview of the phase detector and its associated charge pump unit.

the phase error is positive, it means the clock is late, and therefore the output current is positive. When the phase error is negative, it means the clock is early, and therefore the output current is negative. The bang-bang characteristic is manifested in Fig. 3.35 as well.

To demonstrate the impact of the offsets in the phase detector signal path and the effectiveness of the background offset-cancellation scheme, two simulations can be done based on the scenarios of Fig. 3.25 and Fig. 3.29.

In these two simulations, the setup is the same as the simulation for Fig. 3.35. The difference is, DC voltage sources are inserted to emulate a total of 100 mV input-referred DC offset for the  $V_{\rm A} + V_{\rm B} - 2V_{\rm E}$  branch.

Fig. 3.36 shows the results of these two simulations. In the case of "No Offset Cancellation", the offset-cancellation feedback loop is open as in Fig. 3.25. Simulation results indicate that a 100 mV input-referred DC offset will render the PD ineffective around zero phase error, which will increase BER and phase noise significantly and reduce jitter tolerance



Figure 3.35: Simulated average output current versus phase error.



Figure 3.36: PD characteristics with and without offset-cancellation.

greatly.

For the case of "Offset Cancellation", the "Average Output Current" is calculated when the offset-cancellation feedback loop settles to steady state, since Fig. 3.29 indicates that it takes a while for the loop to reach its steady state. The case of "Offset Cancellation" resembles the ideal case in Fig. 3.35.

Results of Fig. 3.36 show the effectiveness and the necessity of the proposed offsetcancellation scheme.

Offsets are present in the data extraction units or slicers as well and it is therefore of interest to simulate how slicer offsets affect the performance of the PD. In these simulations, all the slicer levels are increased or decreased by a certain amount, and the resulted PD characteristic plot is simulated. Since there are 24 slicers (there are 3 slicers for each data extraction unit and there are 8 such units), it is impossible to show all the combinations since each slicer's offset can be changed by different amount and two different polarities. Therefore, only exemplary results are shown to shine the light on how slicer offsets affect PD characteristics.

Fig. 3.37 shows three cases where the slicer levels are increased or decreased by 34% of the maximum vertical eye opening. All three characteristics resemble the ideal case when the phase error is within  $\pm 8$  ps. When the phase error is larger, there are minor differences between these three plots.

Fig. 3.38 shows three cases where the slicer levels are increased or decreased by 135% of the maximum vertical eye opening. Only case "F" resembles the ideal case. On the other hand, although case "D" and case "E" have significantly smaller average output currents when the phase error is large and smaller PD gain when the phase error is around zero, they still resemble the characteristic of a bang-bang PD and their average outputs are still zero when the phase error is around zero.

Since "D", "E", and "F" have significantly different characteristics in Fig. 3.38, it is worth examining how their slicer levels are changed specifically. In case "D", the +2 level is increased, the -2 level is decreased, and the 0 level can be either increased or decreased.



Figure 3.37: PD characteristics with slicer offsets.



Figure 3.38: (continued) PD characteristics with slicer offsets.



Figure 3.39: Transitions detected in case "D".

In case "E", both the +2 and the -2 levels are increased and the 0 level does not matter. In case "F", the +2 level is decreased, the -2 level is increased, and the 0 level does not matter.

To explain the plots of Fig. 3.37 and Fig. 3.38, it is necessary to go back to the charge pump and how data extraction results affect it. The data extraction units are designed to turn off the charge pump when there is no transition. Shown in Fig. 3.33,  $A_x$  and  $B_x$  signals can only turn both  $M_p$  and  $M_n$  off or leave them to S and D signals. Hence, the data extraction results can not make the charge pump generate an output.

In Fig. 3.37, since the absolute value of the slicer level change is 34% of the maximum vertical eye opening, the changed slicer levels still fall within the eye opening, especially when the phase error is small.

In Fig. 3.38 case "D", applying an increase of 135% of the maximum vertical eye opening to the +2 level makes it larger than the highest differential voltage of the input signal. Therefore, this slicer always produces negative output. On the other hand, applying a decrease of 135% of the maximum vertical eye opening to the -2 level makes it smaller than the lowest differential voltage of the input signal, and this slicer always produces positive output. To sum it up, in case "D", the phase detector only recognizes transitions that cross the new 0-level slicer, which is 135% of the maximum vertical eye opening above the differential 0 as shown in Fig. 3.39(a) or below as in Fig. 3.39(b).

In Fig. 3.38 case "E", the +2 level will be out of the input signal range, but the 0 level



Figure 3.40: Transitions detected in case "E".



Figure 3.41: Transitions detected in case "F".

and ther -2 level are still in range, therefore the PD can detect more transitions as shown in Fig. 3.40 and generate larger output currents.

In Fig. 3.38 case "F", all three levels fall into the input signal range. Although transitions detected here are the same as those in case "E", the current output is larger since some transitions cross two slicer levels in case "F" (Fig. 3.41) instead of just one in case "E".

The above analysis of Fig. 3.37 and Fig. 3.38 has shown that, as long as the CDR can extract data correctly when locked, the PD characteristics resemble the ideal case no matter the slicer offsets. Even if the CDR can not extract data correctly due to large slicer offsets, as long as one of the many slicer levels fall into the input signal range, the PD characteristics still resemble a bang-bang PD with zero output around zero phase error. Hence the CDR is still able to lock to input data signal in this extreme scenario.



Figure 3.42: VCO and inductor design.

## 3.7 VCO and Clock Generation

#### 3.7.1 VCO

[31] and [33] have demonstrated that, in designing a LC VCO for this data rate, the number one factor that determines the lower bound of the power consumption is that the VCO can oscillate and its swing is large enough.

Fig. 3.42 shows the VCO and its inductor design. The inductor simulation shows it has an inductance of 0.96 nH and a Q factor of 17.8.  $M_1 - M_4$  are all 5 × 500/30 nm. The simulated VCO gain is 2.8 GHz/V.

#### 3.7.2 Clock Generation

Fig. 3.43 shows the clock generation circuits. The differential outputs of the VCO drive a divide-by-2 circuit (28 GHz p and 28 GHz n). This divide-by-2 circuit generates four 14-GHz clock signals that are 90° apart.



Figure 3.43: Clock generation.

These four 14 GHz clocks drive a ring-style divide-by-2 circuit, which generates eight 7-GHz clock signals that are 45° apart.

The eight 7-GHz clocks drive another ring-style divide-by-2 circuit that generates sixteen 3.5-GHz clock signals that are 22.5° apart.

All clocks except  $S_5 - S_{10}$  in the proposed comparator come from these sixteen 3.5-GHz clocks. The 25% duty-cycle 3.5-GHz clocks for  $S_5 - S_{10}$  are generated by combining a 3.5-GHz clock signal with a 7-GHz clock signal.

Latches used in Fig. 3.43 (L1 and L2) are based on the divider latch proposed in [31] as shown in Fig. 3.44.



Figure 3.44: Clock divider latch.

# CHAPTER 4

# **Experimental Results**

## 4.1 Die Photograph



Figure 4.1: Die photograph.

The prototype is fabricated in TSMC 28 nm process. Fig. 4.1 shows the die photograph.

## 4.2 Experiment Setup

The experiment uses Keysight M8040A BERT. Keysight M8045A pattern generator provides a 56-Gb/s PAM4 signal. A deserialized 7-Gb/s PAM4 signal (generated from one data extraction unit) is sent to Keysight M8046A error analyzer. The output of the VCO is sent to HP 8565E for clock spectrum and phase noise measurement. Fig. 4.2 shows the diagram.



Figure 4.2: Experiment setup.

## 4.3 Measurement Results

The measured power consumption of the prototype is 8 mW. The phase detector units consume 2.6 mW. The charge pump units consume 0.4 mW. The data extraction units consume 1.1 mW. The VCO consumes 0.9 mW and the clock generation circuits consume 2.8 mW.



Figure 4.3: Jitter transfer.

Fig. 4.3 shows the measured jitter transfer plots. By changing loop filter and charge

pump settings, the loop bandwidth can vary from 25 MHz to 160 MHz.



Figure 4.4: Jitter tolerance.

Fig. 4.4 shows the measured jitter tolerance. The four plots look the same for jitter frequencies lower than 1 MHz, and for several frequencies up to 10 MHz, the jitter amplitudes do not change for 120 MHz case and 160 MHz case, these are because the amplitudes have reached the BERT's limit and can not be further increased. Therefore, for a jitter amplitude of 1  $UI_{pp}$ , the tolerance frequency is at least 10 MHz for 120 MHz bandwidth and 160 MHz bandwidth. For 25 MHz bandwidth and 50 MHz bandwidth, it is around 3 or 4 MHz.

Fig. 4.5 shows the recovered clock spectrum for the case of 160-MHz loop bandwidth. KE5FX software ([34]) is used to capture the screen of HP 8565E.

KE5FX software is also able to plot and calculate the phase noise from 100 Hz offset to 100 MHz offset using HP 8565E as shown in Fig. 4.6. From 100 MHz offset to 14 GHz offset, samples are read from the spectrum analyzer directly to manually calculate the RMS clock jitter.

The integrated RMS clock jitter from 100 Hz to 14 GHz is 574 fs.



Figure 4.5: Recovered clock spectrum.



Figure 4.6: Phase noise from 100 Hz to 100 MHz.

## 4.4 Comparison Table

|                              | Zhang<br>JSSC '20 | Zhao<br>CICC '20 | Kwon<br>TCS2 '19 | Roshan-<br>Zamir<br>JSSC '19 | Auran-<br>gozeb<br>JSSC '18 | This Work |
|------------------------------|-------------------|------------------|------------------|------------------------------|-----------------------------|-----------|
| Data Rate (Gb/s)             | 32                | 29.1             | 32               | 56                           | 28                          | 56        |
| Power (mW)                   | 14.7              | 19.16            | 32               | 49.2*                        | 47*                         | 8         |
| 1 UI Jitter Tol. Freq. (MHz) | 2                 | 1.8              | 1                | 0.5                          | 0.6                         | 10        |
| Loop BW (MHz)                | 10                | 12               | 10               | 10                           | 11                          | 160       |
| CK Jitter (ps)               | 0.352             | 0.487            | 3.8              | N/A                          | 0.513                       | 0.574     |
| Technology (nm)              | 40                | 28               | 28               | 65                           | 65                          | 28        |
| Power Efficiency (pJ/bit)    | 0.46              | 0.66             | 1                | 0.88                         | 1.68                        | 0.14      |

Table 4.1: Performance summary and comparison table

\*Only including CDR portion for fair comparison.

Table 4.1 summarizes the performance and compares it with other state-of-the-art designs. The power consumption and power efficiency is six times better than that of the state-of-the-art CDR of the same data rate. The power efficiency number is also three times better that the lowest number in this time.

The jitter tolerance is twenty times better than the state-of-the-art of the same data rate and five times better than the best state-of-the-art in the table.

The clock jitter of this work, 0.574 ps, is integrated from 100 Hz to 14 GHz. In Zhang JSSC '20 ([10]), it is integrated from 100 kHz to 1 GHz. In Zhao CICC '20, it is integrated from 100 Hz to 1 GHz. In Aurangozeb JSSC '19, it is integrated from 1 kHz to 1 GHz.

# CHAPTER 5

## Conclusion

This work describes a 56-Gb/s PAM4 CDR that has low power consumption and high jitter tolerance. Realized in 28 nm CMOS process, the power consumption is 8 mW.

An one-eighth rate bang-bang PD is proposed. This PD detects phase by calculating Euclidean distances between an edge sample and two neighboring data samples.

The low-power and high-linearity analog front-end of this PD brings large offsets. Therefore, a background offset-cancellation scheme is proposed and a comparator is designed for this scheme. This comparator is able to generate a nearly rail-to-rail output while extracting offset information in the signal path.

Data are extracted by slicers. Operating at one-eighth rate, the slicers can be designed to be low power as well. Simulation results show that the PD characteristic is not susceptible to usual slicer offsets.

Integrated from 100 Hz to 14 GHz, the recovered RMS clock jitter is 0.574 ps. It displays a loop bandwidth of 160 MHz and tolerates at least 1  $UI_{pp}$  at 10 MHz jitter frequency.

The proposed phase detector can be readily used for higher level modulation schemes, such as PAM8. Depending on data rates and circuit processes, it can also be modified into quarter rate or one-sixteenth rate. The proposed offset-cancellation scheme can be readily used for scenarios where a comparator is needed and offset needs to be suppressed as long as the average input signal value is zero.

#### References

- [1] Cisco, "Cisco Annual Internet Report (2018–2023) White Paper." https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annualinternet-report/white-paper-c11-741490.pdf, March 2020.
- [2] S. Gondi and B. Razavi, "Equalization and Clock and Data Recovery Techniques for 10-Gb/s CMOS Serial-Link Receivers," *IEEE Journal of Solid-State Circuits*, vol. 42, no. 9, pp. 1999–2011, 2007.
- [3] X. Song and D. Dove, "Opportunities for PAM4 Modulation," January 2014.
- [4] Huawei, "50G PAM4 Technical White Paper." https://carrier.huawei.com/~/media/ CNBGV2/download/products/networks/50G-PAM4-Technical-White-Paper.pdf.
- [5] LightCounting, "PAM4 DSPs will double the market for IC chipsets used in optical transceivers by 2024." https://www.lightcounting.com/light-trends/pam4-dsps-willdouble-market-ic-chipsets-used-optical-transceivers-2024, March 2020.
- [6] Behzad Razavi, Design of Integrated Circuits for Optical Communications. McGraw-Hill, 2003.
- [7] A. Shehabi et al., "United States Data Center Energy Usage Report," June 2016.
- [8] A. Roshan-Zamir, T. Iwai, Y. Fan, A. Kumar, H. Yang, L. Sledjeski, J. Hamilton, S. Chandramouli, A. Aude, and S. Palermo, "A 56-Gb/s PAM4 Receiver With Low-Overhead Techniques for Threshold and Edge-Based DFE FIR- and IIR-Tap Adaptation in 65-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 3, pp. 672–684, 2019.
- [9] Aurangozeb, A. D. Hossain, M. Mohammad, and M. Hossain, "Channel-Adaptive ADC and TDC for 28 Gb/s PAM-4 Digital Receiver," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 3, pp. 772–788, 2018.
- [10] Z. Zhang, G. Zhu, C. Wang, L. Wang, and C. P. Yue, "A 32-Gb/s 0.46-pJ/bit PAM4 CDR Using a Quarter-Rate Linear Phase Detector and a Self-Biased PLL-Based Multiphase Clock Generator," *IEEE Journal of Solid-State Circuits*, vol. 55, no. 10, pp. 2734– 2746, 2020.
- [11] X. Zhao, Y. Chen, P. Mak, and R. P. Martins, "A 0.0285mm2 0.68pJ/bit Single-Loop Full-Rate Bang-Bang CDR without Reference and Separate Frequency Detector Achieving an 8.2(Gb/s)/µs Acquisition Speed of PAM-4 data in 28nm CMOS," in 2020 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4, 2020.
- [12] D. Kwon, M. Kim, S. Kim, and W. Choi, "A 32-Gb/s PAM-4 Quarter-Rate Clock and Data Recovery Circuit With an Input Slew-Rate Tolerant Selective Transition Detector," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 66, no. 3, pp. 362–366, 2019.

- [13] Intel, "AN 835: PAM4 Signaling Fundamentals." https://www.intel.com/content /dam/www/programmable/us/en/pdfs/literature/an/an835.pdf, March 2019.
- [14] J. D. H. Alexander, "Clock recovery from random binary signals," *Electronics Letters*, vol. 11, pp. 541–542, 1975.
- [15] Jri Lee, K. S. Kundert, and B. Razavi, "Analysis and modeling of bang-bang clock and data recovery circuits," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 9, pp. 1571– 1580, 2004.
- [16] C. Hogge, "A self correcting clock recovery curcuit," Journal of Lightwave Technology, vol. 3, no. 6, pp. 1312–1314, 1985.
- [17] J. Savoj and B. Razavi, "A 10-Gb/s CMOS clock and data recovery circuit with a half-rate linear phase detector," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 5, pp. 761–768, 2001.
- [18] K. Mueller and M. Muller, "Timing Recovery in Digital Synchronous Data Receivers," *IEEE Transactions on Communications*, vol. 24, no. 5, pp. 516–531, 1976.
- [19] R. Walker, "Designing BangBang PLLs for Clock and Data Recovery in Serial Data Transmission Systems," *Phase-Locking in High-Performance Systems: From Devices to Architectures*, pp. 34–45, 2003.
- [20] Behzad Razavi, Design of analog CMOS integrated circuits. McGraw-Hill, 2002.
- [21] Y. Krupnik, Y. Perelman, I. Levin, Y. Sanhedrai, R. Eitan, A. Khairi, Y. Shifman, Y. Landau, U. Virobnik, N. Dolev, A. Meisler, and A. Cohen, "112-Gb/s PAM4 ADC-Based SERDES Receiver With Resonant AFE for Long-Reach Channels," *IEEE Journal* of Solid-State Circuits, vol. 55, pp. 1077–1085, Apr 2020.
- [22] B. Yoo, D. Lim, H. Pang, J. Lee, S. Baek, N. Kim, D. Choi, Y. Choi, H. Yang, T. Yoon, S. Chu, K. Kim, W. Jung, B. Kim, J. Lee, G. Kang, S. Park, M. Choi, and J. Shin, "6.4 A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET Using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier," in 2020 IEEE International Solid-State Circuits Conference (ISSCC), pp. 122–124, 2020.
- [23] C. Loi, A. Mellati, A. Tan, A. Farhoodfar, A. Tiruvur, B. Helal, B. Killips, F. Rad, J. Riani, J. Pernillo, J. Sun, J. Wong, K. Abdelhalim, K. Gopalakrishnan, K. Kim, L. Tse, M. Davoodi, M. Le, M. Zhang, M. Talegaonkar, P. Prabha, R. Mohanavelu, S. Chong, S. Forey, S. Netto, S. Bhoja, W. Liew, Y. Duan, and Y. Liao, "6.5 A 400Gb/s Transceiver for PAM-4 Optical Direct-Detect Application in 16nm FinFET," in 2019 IEEE International Solid- State Circuits Conference - (ISSCC), pp. 120–122, 2019.
- [24] L. Tang, W. Gai, L. Shi, and X. Xiang, "A 40 Gb/s 74.9 mW PAM4 receiver with novel clock and data recovery," in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4, 2017.

- [25] T. Toifl, C. Menolfi, M. Ruegg, R. Reutemann, P. Buchmann, M. Kossel, T. Morf, J. Weiss, and M. Schmatz, "A 22-Gb/s PAM-4 Receiver in 90-nm CMOS SOI Technology," *IEEE Journal of Solid-State Circuits*, vol. 41, pp. 954–965, Apr 2006.
- [26] P.-J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, "6.1 A 56Gb/s PAM-4/NRZ transceiver in 40nm CMOS," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 110–111, 2017.
- [27] M. Hossain, Aurangozeb, and N. Nguyen, "DDJ-Adaptive SAR TDC-Based Timing Recovery for Multilevel Signaling," *IEEE Journal of Solid-State Circuits*, vol. 54, pp. 2833– 2844, Oct 2019.
- [28] N. Qi, Y. Kang, Q. Lin, J. Ma, J. Shi, B. Yin, C. Liu, R. Bai, S. Hu, J. Wang, J. Du, L. Ma, Z. He, M. Liu, F. Zhang, and P. Y. Chiang, "A 51Gb/s, 320mW, PAM4 CDR with baud-rate sampling for high-speed optical interconnects," in 2017 IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 89–92, 2017.
- [29] R. Farjad-Rad, C. K. K. Yang, M. A. Horowitz, and T. H. Lee, "A 0.3-/spl mu/m CMOS 8-Gb/s 4-PAM serial link transceiver," *IEEE Journal of Solid-State Circuits*, vol. 35, no. 5, pp. 757–764, 2000.
- [30] J. Lee and K. Wu, "A 20-Gb/s Full-Rate Linear Clock and Data Recovery Circuit With Automatic Frequency Acquisition," *IEEE Journal of Solid-State Circuits*, vol. 44, no. 12, pp. 3590–3602, 2009.
- [31] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/Deserializer," *IEEE Journal of Solid-State Circuits*, vol. 48, no. 3, pp. 684–697, 2013.
- [32] B. Razavi, "The StrongARM Latch [A Circuit for All Seasons]," IEEE Solid-State Circuits Magazine, vol. 7, no. 2, pp. 12–17, 2015.
- [33] A. Manian and B. Razavi, "A 40-Gb/s 14-mW CMOS Wireline Receiver," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 9, pp. 2407–2421, 2017.
- [34] J. Miles, "KE5FX." http://www.ke5fx.com/.