# UC Irvine UC Irvine Electronic Theses and Dissertations

Title Adaptive Far-End Crosstalk Cancellation for MIMO Channels

Permalink https://escholarship.org/uc/item/8f06z94k

**Author** Han, Jerry

Publication Date 2019

Peer reviewed|Thesis/dissertation

# UNIVERSITY OF CALIFORNIA, IRVINE

Adaptive Far-End Crosstalk Cancellation for MIMO Channels

#### DISSERTATION

submitted in partial satisfaction of the requirements for the degree of

### DOCTOR OF PHILOSOPHY

### in Electrical Engineering and Computer Science

by

Jerry Jifang Han

Dissertation Committee:

Professor Michael M. Green, Chair

Professor Payam Heydari

Professor Filippo Capolino

© 2019 Jerry Jifang Han

# TABLE OF CONTENTS

|                                                    | Page |
|----------------------------------------------------|------|
| LIST OF FIGURES                                    | iv   |
| LIST OF TABLES                                     | x    |
| ACKNOWLEDGMENTS                                    | xi   |
| CURRICULUM VITAE                                   | xii  |
| ABSTRACT OF THE DISSERTATION                       | xiii |
| CHAPTER 1: Introduction                            | 1    |
| Preface                                            | 1    |
| Motivation for Research                            | 4    |
| Organization of Dissertation                       | 5    |
| CHAPTER 2: Channel Loss Equalization               | 6    |
| Equalizers                                         | 8    |
| CHAPTER 3: Crosstalk                               | 17   |
| Far-End Crosstalk Cancellation                     | 24   |
| CHAPTER 4: Adaptive Far-End Crosstalk Cancellation | 28   |

| CHAPTER 5: | Circuit Design, Layout, and Simulation Results    | 41  |
|------------|---------------------------------------------------|-----|
|            | Issues Encountered with Low-Voltage BiCMOS Design | 41  |
|            | Output Drivers                                    | 44  |
|            | Clock and Data Recovery Circuit                   | 47  |
|            | Voltage-Controlled Oscillator (VCO) Design        | 47  |
|            | Phase Detector                                    | 62  |
|            | Crosstalk Cancellation Circuit                    | 71  |
|            | Equalizer                                         | 74  |
|            | Post-Layout Simulation Results                    | 77  |
| CHAPTER 6: | Test Results                                      | 84  |
| CHAPTER 7: | Conclusion                                        | 100 |
| REFERENCE  | S                                                 | 101 |

# LIST OF FIGURES

| Figure 1. Top Level of an Electrical-Optical Communications System                      | 1  |
|-----------------------------------------------------------------------------------------|----|
| Figure 2. Jitter Decomposition                                                          | 2  |
| Figure 3. Data eye with different non-idealities                                        | 3  |
| Figure 4. Frequency domain spectrum filtering of data signal                            | 6  |
| Figure 5. Eye Diagrams of (a) clean signal and (b) signal after lossy channel           | 7  |
| Figure 6. Effect on total frequency response for (a) ideal and (b) non-ideal equalizers | 9  |
| Figure 7. Passive CTLE circuit schematic                                                | 10 |
| Figure 8. Passive CTLE magnitude responses for different capacitance values             | 11 |
| Figure 9. CMOS CTLE circuit schematic                                                   | 12 |
| Figure 10. Active CTLE Magnitude Response                                               | 13 |
| Figure 11. Feed-Forward Equalizer Block Diagram                                         | 14 |
| Figure 12. Pre- and Post- Cursor ISI on a filtered pulse                                | 15 |
| Figure 13. DFE Block Diagram                                                            | 15 |

| Figure 14. Parallel traces with mutual inductance (left) and capacitance (right)       | 17 |
|----------------------------------------------------------------------------------------|----|
| Figure 15. Circuit equivalent for a lossless pair of channels with capacitive coupling | 18 |
| Figure 16. Circuit equivalent for a lossless pair of channels with inductive coupling  | 19 |
| Figure 17. Circuit equivalent for a lossless pair of coupled channels                  | 20 |
| Figure 18. RLGC lumped model for 4 port networks                                       | 21 |
| Figure 19. Magnitude Responses for FEXT and Insertion Loss                             | 22 |
| Figure 20. Odd mode coupling illustrating electric fields and magnetic fields          | 23 |
| Figure 21. Even mode coupling illustrating electric fields and magnetic fields         | 23 |
| Figure 22. Waveforms with capacitive (left) and inductive (right) dominated FEXT       | 24 |
| Figure 23. I/O Staggering                                                              | 25 |
| Figure 24. Pre-Distortion                                                              | 26 |
| Figure 25. Direct Cancellation of FEXT at receiver end                                 | 27 |
| Figure 26. High speed logic to determine transitioning mode from [15]                  | 30 |
| Figure 27. Full top level diagram of adaptive XTC as shown in [16]                     | 33 |
| Figure 28. Timing diagram of Figure 27                                                 | 34 |
| Figure 29. Comparison between full-rate and half-rate mode signals                     | 35 |

| Figure 30. Comparison of full-rate and half-rate adaptation error signals                 | 36 |
|-------------------------------------------------------------------------------------------|----|
| Figure 31. Adaptation time system level block diagram                                     | 38 |
| Figure 32. Typical buffer chain in BiCMOS process                                         | 42 |
| Figure 33. Cascaded CML buffers with no emitter-follower                                  | 43 |
| Figure 34. 1" Differential Channel Modeled in HFSS with corresponding S-parameters        | 45 |
| Figure 35. Schematic for 50Gbps output driver stage                                       | 46 |
| Figure 36. Layout for predriver and output driver                                         | 47 |
| Figure 37. Transconductance                                                               | 49 |
| Figure 38. Level shifting in VCOs using (a) high-pass filtering and (b) emitter followers | 50 |
| Figure 39. Initial QVCO schematic                                                         | 52 |
| Figure 40. Final QVCO schematic                                                           | 53 |
| Figure 41. Standard capacitor banks                                                       | 54 |
| Figure 42. Capacitor bank realized using npn transistors instead of MOS switches          | 56 |
| Figure 43. Emitter-follower with loading (a) schematic and (b) equivalent circuit         | 58 |
| Figure 44. 25GHz QVCO and clock driver layout                                             | 59 |

| Figure 45. VCO frequency bands                                                 | 61 |
|--------------------------------------------------------------------------------|----|
| Figure 46. Simulated phase noise of clock and divide-by-two output             | 62 |
| Figure 47. Half-rate binary phase detector                                     | 63 |
| Figure 48. Standard high-speed latch schematic for BiCMOS process              | 63 |
| Figure 49. D-Latch using AC coupling to level shift                            | 64 |
| Figure 50. CML D-Latch                                                         | 65 |
| Figure 51. Pseudo-differential DFF                                             | 66 |
| Figure 52. Conventional CML XOR                                                | 67 |
| Figure 53. Symmetric CML XOR gate                                              | 68 |
| Figure 54. Simulated best- and worst-case phase detector output characteristic | 69 |
| Figure 55. TSPC divide-by-two                                                  | 70 |
| Figure 56. CMU Divider sensitivity curves                                      | 71 |
| Figure 57. Original current-mode summation for crosstalk cancellation          | 72 |
| Figure 58. Updated current-mode summation for crosstalk cancellation           | 73 |
| Figure 59. Simplified schematic of CTLE                                        | 75 |
| Figure 60. Final layout of the chip                                            | 78 |

| Figure 61. Parallel microstrip lines in HFSS and cross-sectional view as seen in ADS     | 79 |
|------------------------------------------------------------------------------------------|----|
| Figure 62. S-parameters for expected worst-case channel                                  | 80 |
| Figure 63. Control voltages for both adaptive equalizer and crosstalk cancellation loops | 81 |
| Figure 64. CTLE output for worst-case channel                                            | 82 |
| Figure 65 (a) XTC and clock, and (b) 50Gbps test port outputs for worst-case channel     | 83 |
| Figure 66. Outputs for a 2" channel                                                      | 84 |
| Figure 67. Die photo before wirebonding                                                  | 85 |
| Figure 68. Input 49.12Gbps '11110000' data retimed by half-rate recovered clock          | 87 |
| Figure 69. Divide-by-8 of recovered clock with input 49.12Gbps '11110000' data           | 88 |
| Figure 70. Divide-by-8 of recovered clock with input 49.12Gbps PRBS7 data                | 89 |
| Figure 71. Divide-by-8 of recovered clock with input 49.12Gbps PRBS31 data               | 90 |
| Figure 72. Jitter measurement of demuxed output of 49.38Gbps '11001100' pattern          | 91 |
| Figure 73. Testbench                                                                     | 92 |
| Figure 74. 49.38Gbps PRBS7 signal with '1010' crosstalk                                  | 94 |
| Figure 75. 49.38Gbps PRBS7 signal with '1010' crosstalk after XTC (optimal setting)      | 95 |

| Figure 76. 49.38Gbps PRBS7 signal with '1010' crosstalk after XTC (adaptive setting) | 96 |
|--------------------------------------------------------------------------------------|----|
| Figure 77. 49.38Gbps PRBS7 signal with PRBS9 crosstalk                               | 97 |
| Figure 78. 49.38Gbps PRBS7 signal with PRBS9 crosstalk (optimal setting)             | 98 |
| Figure 79. 49.38Gbps PRBS7 signal with PRBS9 crosstalk (adaptive setting)            | 99 |

# LIST OF TABLES

|                                                                                            | Page |
|--------------------------------------------------------------------------------------------|------|
| Table 1. Transitioning mode and edge timing for crosstalk dominated by capacitive coupling | 29   |
| Table 2. Transitioning mode and edge timing for crosstalk dominated by inductive coupling  | 29   |
| Table 3. Operation of Figure 26                                                            | 30   |
| Table 4. Error signal generation for crosstalk dominated by capacitive coupling            | 31   |
| Table 5. Error signal generation for crosstalk dominated by inductive coupling             | 32   |
| Table 6. Modified operation of mode detector                                               | 32   |
| Table 7. Difference between simulated and theoretical adaptation times                     | 40   |
| Table 8. Summary of eye opening for worst-case channel                                     | 83   |

#### ACKNOWLEDGMENTS

I would like to thank my academic advisor, Professor Michael M. Green, for being an excellent mentor in all aspects, starting from his instruction during my undergraduate studies. He has been extremely patient and has offered constructive advice throughout this doctoral program that has allowed the completion of this dissertation to be possible. I would also like to thank the remainder of my committee, Professor Heydari and Professor Capolino, for also guiding me during my time at the University of California, Irvine.

I would also like to thank both the Physical Layer Products (PLP) design verification team and the analog/mixed-signal design team at Broadcom Inc., where I interned during my graduate studies, for allowing me to acquire invaluable experience that has been extremely beneficial to my research.

I would like to finally extend my deepest gratitude to my parents Chang-Chaio Han and Shingru Chen, my brother Allen Han, and my wife Fan Wei, for seeing me through this entire process. It is thanks to their moral support and encouragement that I was able to come this far.

#### **CURRICULUM VITAE**

#### Jerry Han

Ph.D. in Electrical Engineering and Computer Science, March 2019University of California, IrvineDissertation: Adaptive Far-End Crosstalk Cancellation for MIMO ChannelsAdvisor: Dr. Michael M. Green

M.S. in Electrical Engineering and Computer Science, March 2013 University of California, Irvine

B.S. in Electrical Engineering and Computer Science, June 2011University of California, Irvine

#### FIELD OF STUDY

High-speed analog and mixed-signal circuit design

#### PUBLICATIONS

J. Han and M. Green, "A  $2 \times 50$ -Gb/s receiver with adaptive channel loss equalization and farend crosstalk cancellation", 2015 IEEE International Symposium on Circuits and Systems (ISCAS), pp.2387-2400, May 2015.

#### ABSTRACT OF THE DISSERTATION

Adaptive Far-End Crosstalk Cancellation for MIMO Channels

By

Jerry Jifang Han

Doctor of Philosophy in Electrical Engineering and Computer Science

University of California, Irvine, 2019

Professor Michael M. Green, Chair

Both channel loss and crosstalk present system-level bottlenecks to high-speed wireline transceivers. While there has been extensive research in channel loss equalization, only in recent years has crosstalk cancellation become required in various electrical systems. This dissertation focuses on the physical nature of far-end crosstalk as well as its negative impact on high-speed receivers. A blind adaptive architecture with minimal hardware and complexity overhead is also presented.

These concepts are utilized to design and fabricate a receiver in TowerJazz's SBC18H3 BiCMOS process and has an operational speed up to 2 x 49.38Gbps while drawing 187mA from a 1.8V supply. Measurements show that the adaptation is functional and close to the optimal point, and that the eye-width can be improved by up to 270mUI with a PRBS aggressor present.

## **Chapter 1 Introduction**

#### Preface

With broadband communications serving as the backbone for high-speed data transmission, growing demands for faster data-rates while maintaining legacy interfaces and low power impose great challenges on the design of these communications systems. Conventional high-speed interfaces often feature a combination of both electrical and optical systems to achieve the best performance for long-haul communications, as shown in Figure 1. As data rates increase, much greater stress is placed on the electrical domain, and many methods have been realized in an effort to combat the undesirable effects brought about by the nonidealities of channels used in data transmission.



Figure 1. Top Level of an Electrical-Optical Communications System

While it is well understood why a particular physical property can affect a signal qualitatively, it quickly becomes difficult to pinpoint the magnitude of its impact when a variety of nonidealities are introduced together and close the data eye; therefore, an effort has been made to quantify the deterioration of a transmitted data eye by analyzing both its noise (voltage) and jitter (time-domain). For receivers, the inherent noise properties of a signal are generally of far less importance compared to the jitter.

As shown in Figure 2, jitter can be decomposed into several types. Unbounded, random jitter (RJ) is due to noise inherent in both active and passive devices. Bounded jitter can be separated into correlated and uncorrelated; the former meaning that the jitter present on the data is dependent on the data transmitted whereas the latter is from an external source. Typically a significant amount of random jitter stems from the transmitter in the system, and data-dependent jitter (DDJ) comes from filtering effects from the channel. Both of these issues can be mitigated to some degree through both robust design in the transmitter to limit random jitter, and equalization to compensate for the channel. In single-channel data transmission, bounded-uncorrelated jitter (BUJ) is dominated primarily by periodic jitter on the clock which arises from



Figure 2. Jitter Decomposition

power supply spurs, asynchronous clock coupling, as well as other periodic noise sources in the communications system. The specifications of these particular non-idealities can often be known

in advance during design and thus dealt with through proper layout and shielding as well as appropriate use of filtering. However, with the increase in multi-channel systems, crosstalk between data channels becomes an emerging problem, especially with ever increasing data rates.



Figure 3. Data eye with (a) random noise, (b) periodic and random jitter, (c) channel loss, (d)

#### crosstalk

Shown in Figure 3, random noise, periodic jitter, channel loss, and far-end crosstalk all manifest in different ways to deteriorate the data eye. At higher speeds, the combined effect of all of these sources of jitter and noise will collapse the data eye, making it very difficult for the

clock-and-data recovery (CDR) circuit to function appropriately at the receiver end. Efforts made in transmitter design [1,2], particularly in the phase-locked loop (PLL), and the analog front end of the receiver all function to relax the retiming requirements imposed on the far-end receiver's CDR.

#### **Motivation for Research**

A majority of wireline telecommunications circuit and system design techniques developed in the past decade have been developed to address single-channel effects. In industry, there also already exist several adaptive solutions for channel loss equalization that are widely used today. Crosstalk becomes a pressing issue with denser interconnects carrying higher speed traffic than before. Some works have featured solutions to this problem at both the transmitter and receiver ends, but little has been done to develop an adaptive solution for crosstalk. Adaptation in general allows for a single transceiver to converge on an optimal setting even with PVT variations, allowing for a more robust design, and better system performance as a whole. While it is possible to make use of digital signal processing (DSP) already present in the electrical system as a brute force solution to the problem, this often requires excess power consumption and hardware overhead that may not necessarily be required if there exists a more elegant solution. The approach of this research is to dissect the properties of far-end crosstalk and its impact on the data eye and reach an adaptive solution that can be implemented in the analog front-end of a receiver both with and without the aid of a digital back-end. The primary novelty of this work is that not only is the hardware overhead very small compared to singlechannel designs, it can also be scaled to further relax speed requirements within the adaptation loop.

#### **Organization of Dissertation**

As this research primarily focuses on cleaning up the data signal after it has passed through a stressful channel, Chapters 2 and 3 address channel loss equalization and crosstalk cancellation, respectively, while discussing architectural differences and tradeoffs. Chapter 4 will cover the crosstalk cancellation adaptation loop, including the thought process leading to the solution as well as a further modification to allow compatibility and make use of already existing hardware on-chip for better integration. Chapter 5 will exclusively discuss the circuit design choices and techniques used for the chip, with further details on design methodology and challenges faced to realize this at 50 Gigabit-per-second, particularly in the CDR design. Discussion on the speed requirements of every circuit element will also be included, as well as a comparison between different circuit topologies for both the feedforward and especially feedback paths. Finally, Chapter 6 will go over the simulation results of the final design, and Chapter 7 will cover measurement results of the returned chip.

### **Chapter 2 Channel Loss Equalization**

One of the major limitations in electrical interfaces for high-speed broadband data transmission is channel loss. Figure 4 illustrates the effect of spectrum filtering on a conventional data signal. The original signal (red) after passing through the frequency response of a lossy channel (pink) loses a lot of spectral content as the frequency increases, as shown in the filtered spectrum (blue). Since different amounts of spectral power are lost at different frequencies, the resulting signal after passing through the channel differs considerably from the original. This effect becomes more pronounced as data rates increase while attempting to use legacy interfaces initially designed for lower bandwidths.



Figure 4. Frequency domain spectrum filtering of data signal



Figure 5. Eye Diagrams of (a) clean signal and (b) signal after lossy channel.

Figure 5 shows the time-domain equivalent of Figure 4, with a 10-Gbps PRBS signal passing through a lossy channel with 10-dB insertion loss at 5 GHz. Both the horizontal and vertical opening of the eye diagram in Figure 5(b) are much smaller than that of the clean signal in Figure 5(a). This occurs because bit sequences with higher transition density (e.g. '10101101') correspond to overall higher frequency content and are unable to reach the desired steady-state amplitude whereas bit sequences with lower transition densities (e.g. '11110000') have greater low-frequency content and are able to reach steady-state. The change in pulse-width between different bits results in DDJ. Because this change for any particular unit interval is directly related to its adjacent bits, channel-loss induced DDJ is also referred to as inter-symbol interference (ISI).

### Equalizers

In order to combat channel loss effects, equalization is necessary. In this section, different equalizer architectures will be introduced along with their respective pros and cons.

The simplest type of equalizer is the continuous-time linear equalizer (CTLE). Ideally these analog equalizers function by trying to undo the channel response, yielding an overall flat response over the frequencies of interest, as shown in Figure 6(a). However, in practice it is infeasible to perfectly mimic the channel response for all frequencies, and due to bandwidth limitations, ultra-high frequency content becomes very difficult to recover as well. In Figure 6(b), the equalized output would have a relatively flat response up to 25 GHz compared to the 10dB-insertion loss from just the channel itself. This will open up the eye sufficiently for data streams up to 50-Gbps.



(a)



(b)

Figure 6. Effect on total frequency response for (a) ideal and (b) non-ideal equalizers

Figure 7 shows a passive circuit implementation of a CTLE, whose transfer function is given by:

$$\frac{V_{out}}{V_{in}} = \frac{R_2}{R_1 + R_2} \frac{1 + sR_1C_1}{1 + s\frac{R_1R_2}{R_1 + R_2}(C_1 + C_2)}$$
$$p = \frac{1}{\frac{R_1R_2}{R_1 + R_2}(C_1 + C_2)} \qquad z = \frac{1}{R_1C_1}$$

With different resistance and capacitance values, the response of the filter can vary between a low-pass, all-pass, and high-pass response, as shown in Figure 8. For values of  $R_1C_1 < R_2C_2$  the circuit will exhibit a high-pass response, which is desired for this application. Both R<sub>2</sub> and C<sub>2</sub> can be implemented as variable devices to offer a wide range of tunability in both the DC gain and peaking response of the filter.



Figure 7. Passive CTLE circuit schematic



Figure 8. Passive CTLE magnitude responses for different capacitance values

Because this is a passive network, the maximum achievable gain is unity, and often much less for practical implementations as an equalizer, active circuits are typically more attractive. An example of a CMOS implementation of a CTLE is shown in Figure 9, with a small-signal transfer function given by:

$$\left|\frac{V_{out}}{V_{in}}\right| = \frac{g_m R_D}{1 + g_m R_S} \frac{1 + sR_S C_S}{\left(1 + s\frac{R_S C_S}{1 + g_m R_S}\right)(1 + sR_D C_D)}$$
$$p_1 = \frac{1}{R_D C_D}, \qquad p_2 = (1 + g_m R_S)\frac{1}{R_S C_S} \qquad z = \frac{1}{R_S C_S}$$



Figure 9. CMOS CTLE circuit schematic

Like the passive CTLE, the active version can also be tuned by making  $R_s$  and  $C_s$  variable components, as shown in Figure 10. Increasing  $R_s$  allows peaking at a slightly higher frequency at the expense of DC gain as compared to increasing  $C_s$ , and can give much higher peaking at high frequencies as long as the system can tolerate the decrease in signal amplitude. In practice, a combination of tuning both parameters is commonly used.



Figure 10. Active CTLE Magnitude Response

There are many different ways [3, 4] to realize a peaking filter aside from the aforementioned circuits. Due to their simplicity and relatively small footprint on the die, CTLEs are often the preferred equalization method when possible. However, CTLEs are only effective when the channel response is relatively simple. For more complicated scenarios, such as non-monotonic insertion loss, excessive reflection, or resonant frequencies, it becomes difficult for CTLEs to properly reverse the channel response without increasingly complicated filter implementations. In instances where a large amount of insertion loss is present, CTLE also becomes unsuitable due to the amount of peaking required.

The second type of equalizer is the feed-forward equalizer (FFE), which is composed of multiple delay stages and a summing cell, as shown in Figure 11. The FFE opens the eye by

trying to cancel out both pre- and post-cursor ISI as defined in Figure 12. By adjusting the tap weights  $a_0...a_k$  individually, an optimal setting for the FFE that removes the largest amount of ISI can be found. In the frequency domain, adjustment of tap weights also corresponds to a change in the frequency response [5].



Figure 11. Feed-Forward Equalizer Block Diagram



Figure 12. Pre- and Post- Cursor ISI on a filtered pulse



Figure 13. DFE Block Diagram

The third type of equalizer, shown in Figure 13, is the decision-feedback equalizer (DFE), which is also composed of multiple delays and a summation. The DFE however employs a feedback loop and a dedicated slicer to digitize the signal. Due to the nature of the DFE's

operation, it is only able to cancel post-cursor ISI since unlike the FFE, all weighted sums used are delayed versions of the input signal. Additionally, there is often a strict timing requirement imposed on the slicer for proper operation in the feedback loop since in practice the delay cells are realized as DFFs with timing margin requirements [6]. This sets an upper limit on the maximum speed of a DFE that can be realized in a given technology.

Due to the inherent digitization of the signal, utilization of a DFE has the benefit of not amplifying noise and crosstalk present on the signal at the front end of the receiver, unlike a CTLE or a FFE. However, for scenarios where the data eye is completely closed, it becomes difficult for a DFE to make a decision at the slicer, which can result in error propagation; therefore, an analog equalizer is often required to first open the eye by some amount. For complicated systems, a combination of all three types of equalizers is often used for high-speed transceivers in order to achieve optimal performance.

## **Chapter 3 Crosstalk**

In recent years, crosstalk has become a more apparent problem as dense parallel traffic becomes necessary even for ultra-high data rates. Crosstalk is the coupling of one propagating signal onto another signal via mutual inductance or capacitance as illustrated in Figure 14. The focus of this research is on far-end crosstalk (FEXT) cancellation.



*Figure 14. Parallel traces with mutual inductance (left) and capacitance (right)* 

In order to remove FEXT, it is important to understand how it manifests itself onto the transmitted signal. As crosstalk comes from both capacitive and inductive coupling between channels, applying basic circuit theory allows for a qualitative understanding.

From the circuit in Figure 15, we can derive the following expressions for capacitive coupling as

$$I_{FEXT} = C_m \frac{d(V_{L,aggressor} - V_{L,victim})}{dt}$$

$$I_{Load} = I_{victim} + I_{FEXT} = I_{victim} + C_m \frac{d(V_{L,aggressor} - V_{L,victim})}{dt}$$



*Figure 15. Circuit equivalent for a lossless pair of channels with capacitive coupling* 

Likewise, from the circuit in Figure 16, we can derive the following expressions for inductive coupling as

$$V_{FEXT} = L_m \frac{dI_{aggressor}}{dt}$$

$$V_{Load} = V_{S,victim} - V_{FEXT} = V_{S,victim} - L_m \frac{dI_{aggressor}}{dt}$$



Figure 16. Circuit equivalent for a lossless pair of channels with inductive coupling

If we were to consider both inductive and capacitive coupling simultaneously, as is the case in reality, we arrive at the circuit in Figure 17 and obtain the expressions

$$V_{L,victim} = V_{S,victim} - V_{FEXT} + AZ_{Load}I_{FEXT}$$
$$= V_{S,victim} - L_m \frac{dI_{aggressor}}{dt} + AZ_{Load}C_m \frac{dV_{L,aggressor}}{dt}$$
(1)

where A is a scalar dependent on the intrinsic capacitance  $C_P$  and the terminating resistor  $Z_L$ .



Figure 17. Circuit equivalent for a lossless pair of coupled channels

Based on these expressions, it is apparent that FEXT is proportional to the derivative of the aggressor signal. It can also be observed from the difference in sign of the two expressions on the right-hand side of (1), that given the appropriate values of  $L_m$  and  $C_m$ , the disturbances caused by the mutual inductive and capacitive coupling can effectively be cancelled out. Unfortunately, for a large majority of board designs, it is not practical or possible for such an arrangement, and the crosstalk is dominated by either mutual inductance or capacitance.

For realistic channels, the lumped model of these channels includes lossy parameters, and

|                                                                   | <i>L</i> 11 | L21        | ן <i>L</i> 22 |
|-------------------------------------------------------------------|-------------|------------|---------------|
| are often modeled using a RLGC table of parameters, with the form | R11<br>C11  | R21<br>C21 | R22<br>C22    |
|                                                                   | LG11        | G21        | G221          |

corresponding to the lumped model in Figure 18. For practical use, these physical parameters are often first converted into impedances and then into S-parameters for more intuitive understanding of the channel characteristics over broadband frequencies.



Figure 18. RLGC lumped model for 4 port networks

Using physical parameters for a real pair of microstrip lines, and converting the generated RLGC matrices into S-parameters over the frequencies of interest, the plotted magnitude responses for both the insertion loss and crosstalk can be seen in Figure 19. The high-pass characteristic observed for the FEXT response is consistent with the derivations obtained from the lossless models.

Because FEXT is proportional to the derivative, or high frequency content, of the aggressor signal, the excitation caused on the victim signal is largest when coincident with the transitions of the aggressor. For synchronous signals that are transmitted, as is the case in many dense parallel links, the effect is then largest at the zero-crossings of the signals, resulting in substantially larger jitter.



Figure 19. Magnitude Responses for FEXT and Insertion Loss

To understand how FEXT impacts the timing edge, it is instructive to look at the waveforms for the two propagating modes that do result in jitter. During the odd mode, where the two signals are exhibiting opposite transitions, the respective electric and magnetic fields can be seen in Figure 20. From these diagrams, the expressions for the net current and voltage can be derived as

$$I_{C,odd} = C_p \frac{dV}{dt} + C_m \frac{d(V - (-V))}{dt} = (C_p + 2C_m) \frac{dV}{dt}$$
$$V_{L,odd} = L_s \frac{dI}{dt} + L_m \frac{d(-I)}{dt} = (L_s - L_m) \frac{dI}{dt}$$

For the odd mode, the effective capacitance increases while the inductance decreases. The expressions for even mode propagation can be derived in a similar way, based on Figure 21.

$$I_{C,even} = C_p \frac{dV}{dt} + C_m \frac{d(V-V)}{dt} = C_p \frac{dV}{dt}$$
$$V_{L,even} = L_s \frac{dI}{dt} + L_m \frac{dI}{dt} = (L_s + L_m) \frac{dI}{dt}$$



(a)



(b)

Figure 20. Odd mode coupling illustrating (a) electric fields and (b) magnetic fields



Figure 21. Even mode coupling illustrating (a) electric fields and (b) magnetic fields

The changes in the net capacitance and inductance between the even, odd, and superposition modes ultimately result in three different timings for the transitioning edge and is known as crosstalk-induced jitter (CIJ).

Observations made on a transient waveform can also be used to verify the aforementioned derivations, as shown in Figure 22. As expected, for the case where the mutual capacitance is the dominant cause of FEXT, the transition timing is late for the odd mode, and



Figure 22. Waveforms with capacitive (left) and inductive (right) dominated FEXT

early for the even mode. The opposite holds true for the case where the mutual inductance is the dominant cause of FEXT.

# **FEXT Cancellation**

Shown in Figure 23, one method to suppress the effect of FEXT is through a method called "I/O Staggering" where signals adjacent to one another are sent at quadrature phase to one another [7]. This results in a large majority of the crosstalk energy showing up as amplitude

distortion only, and greatly minimizes the crosstalk-induced jitter. While this is perhaps the easiest solution to mitigating crosstalk's effect, it is not always possible due to a common necessity for many high-speed interfaces to be synchronous. Also, for transmitters triggering on sub-rate clocks, multiphase clocks are necessary to generate the phase offset required, which may not always be available.



Figure 23. I/O Staggering

Another method of eliminating FEXT, shown in Figure 24, is called pre-distortion [8,9], where the signal at the transmitter end is purposely distorted in such a way such that after passing through the channel, the signal comes out with little crosstalk. This is similar in concept to pre-emphasis done by an FFE on conventional transmitter outputs to help compensate for channel loss and relax requirements on the receiver end. The primary challenge with this approach is that without any feedback mechanism, there is no way to determine the optimal setting for the XTC circuit.



Figure 24. Pre-Distortion

At the receiver end, the only option to cancel FEXT is through direct cancellation [10,11], as shown in Figure 25. This method attempts to remove crosstalk by either adding or subtracting the derivative of the aggressor signal from the victim signal. While theoretically very similar to the pre-distortion method at the transmitter, direct cancellation at the receiver is much more difficult, particularly at higher speeds. At the transmitter, signals are well conditioned, and the bit sequence of the aggressor is also known. However, at the receiver end, the combination of channel loss and crosstalk result in distorted signals that make accurate generation of the derivative signal very challenging. Despite these challenges, it is still worthwhile to perform XTC at the receiver end because blind adaptation is possible.



Figure 25. Direct Cancellation of FEXT at receiver end

It is important to understand that while DFE is an effective solution for channel loss as well as dispersion, it is ineffective in reducing crosstalk. DFEs rely on delayed versions of the signal under stress to function properly while eliminating crosstalk requires information given by the aggressor signal.

In the next chapter, the adaptation scheme for crosstalk cancellation will be presented. Because the origin of crosstalk is fundamentally different from that of channel loss equalization, the method will require distilling the error signal from both differential signals on the chip, rather than just one.

# **Chapter 4** Adaptive Crosstalk Cancellation

There have been some works in recent years to realize a crosstalk cancellation with an adaptive loop. Not only does adaptation allow the circuit to operate at the optimal setting automatically, it also enables more robust operation across PVT.

Adaptation at the transmitter end always requires some handshake sequence together with some training sequence to be effective. At the receiver end, two methods of training are popular. The first requires a training sequence such as "1010" or "1100" [8], and the second method shuts off the victim path and directly measures the power of the residual crosstalk signal[12]. Finally, blind adaptation techniques reported thus far still require knowledge about the type of channel coupling (i.e., either capacitive or inductive) [13,14]. In this study, a blind adaptation scheme is proposed that requires neither prior knowledge of the channel nor the type of coupling as the adaptation algorithm itself determines the coupling coefficient and its sign.

In order for the adaptation loop to converge to the correct setting, an accurate error signal is needed. Since crosstalk is composed of only high-frequency broadband content, it is difficult to discern just the excess spectral content due to crosstalk in the presence of the PRBS signal's own power spectrum. Thus, frequency-dependent distilling of the magnitude of the crosstalk still present on the data signal without a training sequence is not a valid solution. Even if this magnitude could be determined, unlike adaptation schemes for channel loss equalization, an adaptive crosstalk cancellation scheme's error signal must contain not only information regarding the magnitude of the crosstalk, but also the polarity.

| Transitioning Mode | Effect on Edge Timing |
|--------------------|-----------------------|
| Even               | Early                 |
| Superposition      | No Change             |
| Odd                | Late                  |

Table 1. Transitioning mode and edge timing for crosstalk dominated by capacitive coupling

| Transitioning Mode | Effect on Edge Timing |
|--------------------|-----------------------|
| Even               | Late                  |
| Superposition      | No Change             |
| Odd                | Early                 |

# Table 2. Transitioning mode and edge timing for crosstalk dominated by inductive coupling

It can be seen from Tables 1 and 2 that for any given XT coefficient, there is a correlation between the transitioning mode and the phase shift on the victim signal. In order to obtain full information about the crosstalk present, circuits that perform both mode detection and phase detection circuit are required.

To perform mode detection, it is important to accurately determine the relationship between the transitions of the two propagating signals. In order to determine the exact mode at any given time, high-speed logic operating at the full-rate speed is required [15], as shown in Figure 26. The operation of Figure 26 is given in Table 3 to cover all states.



Figure 26. High speed logic to determine transitioning mode from [15]

| D <sub>0</sub> (t <sub>0</sub> ) | $D_0(t_0 + T_b)$ | D <sub>1</sub> (t <sub>0</sub> ) | $D_1(t_0+T_b)$ | Propagating<br>Mode | Odd Mode<br>Signal | Even Mode<br>Signal |
|----------------------------------|------------------|----------------------------------|----------------|---------------------|--------------------|---------------------|
| 0                                | 0                | 0                                | 1              | Superposition       | 0                  | 0                   |
| 0                                | 1                | 0                                | 1              | Even                | 0                  | 1                   |
| 1                                | 0                | 0                                | 1              | Odd                 | 1                  | 0                   |
| 1                                | 1                | 0                                | 1              | Superposition       | 0                  | 0                   |
| 0                                | 0                | 1                                | 0              | Superposition       | 0                  | 0                   |
| 0                                | 1                | 1                                | 0              | Odd                 | 1                  | 0                   |
| 1                                | 0                | 1                                | 0              | Even                | 0                  | 1                   |
| 1                                | 1                | 1                                | 0              | Superposition       | 0                  | 0                   |

Table 3. Operation of Figure 26

Aside from the delay cells, the combinational logic elements in the Figure 26 circuit schematic can be realized using high-speed current-mode logic (CML). Relatively tight control on the unit delay cells is required for proper operation and must be within  $\pm 25\%$  of the unit interval. While this margin may seem generous, at very high speeds (e.g. 50 Gbps) the delay can

easily vary by more than the requisite range across PVT, and an extra control loop is likely required to ensure proper operation.

While the output of the mode detector is two digital signals, the actual mode signal useful for performing adaptation is a tri-state signal achieved by subtracting the even and odd mode signals via current summation.

If the delay cell is controlled to have a delay close to a unit interval, the Figure 26 circuit can also perform phase detection at the output of the XOR gates. However, as this is not trivial to realize, a separate means of phase detection is required. On the receiver end, there is already a CDR present composed of a phase detector, integrator, and clock generating circuit. Since the phase detector from the CDR is already present on chip, the phase detector output containing phase information on each edge can be "recycled" and used in both the CDR and adaptive XTC loops.

By multiplying the mode signal and phase detector outputs, the resulting error signal will have the proper polarity, as shown in Tables 4 and 5. The error signal generated for each case has the correct polarity to ensure correct operation for adaptation.

| Transitioning | Effect on   | Mode Signal | CDR Phase       | Error Signal |
|---------------|-------------|-------------|-----------------|--------------|
| Mode          | Edge Timing | Mode Signal | Detector Output | Error Signar |
| Even          | Early       | 1           | -1              | -1           |
| Superposition | No Change   | 0           | 0               | 0            |
| Odd           | Late        | -1          | 1               | -1           |

Table 4. Error signal generation for crosstalk dominated by capacitive coupling

| Transitioning<br>Mode | Effect on<br>Edge Timing | Mode Signal | CDR Phase<br>Detector Output | Error Signal |
|-----------------------|--------------------------|-------------|------------------------------|--------------|
| Even                  | Late                     | 1           | 1                            | 1            |
| Superposition         | No Change                | 0           | 0                            | 0            |
| Odd                   | Early                    | -1          | -1                           | 1            |

### Table 5. Error signal generation for crosstalk dominated by inductive coupling

For proper operation of a phase detector, the output should be zero for cases where no transitions occur; that is, only when transitions are occurring will there be any meaningful information about the phase. If the XTC error signal contains phase information, this implies then that the error signal will be zero for the superposition mode. Only for the even and odd modes will there be meaningful phase information.

The proposed adaptation scheme takes advantage of the phase detector in the CDR already on chip to perform both phase and mode detection. Since the output of this phase detector is on average zero for the superposition mode, the actual state of the mode signal during these transitioning periods is irrelevant, and the tri-state signal can be simplified to the following truth table.

| <b>D</b> <sub>0</sub> (t <sub>0</sub> ) | $D_0(t_0+T_b)$ | $D_1(t_0)$ | $D_1(t_0+T_b)$ | Propagating<br>Mode | Odd Mode<br>Signal | Even Mode<br>Signal |
|-----------------------------------------|----------------|------------|----------------|---------------------|--------------------|---------------------|
| 0                                       | 0              | 0          | 1              | Superposition       | Х                  | Х                   |
| 0                                       | 1              | 0          | 1              | Even                | 0                  | 1                   |
| 1                                       | 0              | 0          | 1              | Odd                 | 1                  | 0                   |
| 1                                       | 1              | 0          | 1              | Superposition       | Х                  | Х                   |

### Table 6. Modified operation of mode detector

It can easily be seen from Table 6 that a single XOR gate can accomplish the same necessary function as the complex high-speed logic from Figure 26. This simplifies the mode detection operation to a single gate. An implementation of this adaptation was published in [16],

and shown in Figure 27. Although the CDR operates at half-rate, the mode detection still operates at the full-rate of 50-Gbps, and places strict requirements on the XOR required for mode detection.

A timing diagram of the critical waveforms is shown in Figure 28. As indicated on Figure 27, S1' and S2' are the two data signals that exhibit crosstalk. As the even-mode transitions result in an early edge, whereas the odd-mode transitions result in a late edge, the crosstalk must be dominated primarily by capacitive coupling, and thus the crosstalk replica with weight  $\alpha$  should be positive.



*Figure 27. Full top level diagram of adaptive XTC as shown in [16]* 



Figure 28. Timing diagram of Figure 27

With  $V_{phase}$  and the resulting  $V_{Mode}$  applied to the mixer inputs, the resulting error signal  $V_{\epsilon}$  is on average positive. Integrating this error signal increases the crosstalk cancellation control voltage  $V_{XTC}$ , which is the correct polarity desired. To further boost the loop gain,  $V_{phase}$  can be made up of both CDR phase detector outputs. This ensures that even when one signal is experiencing low transition density, the overall phase information used for adaptation does not drift needlessly.

To further relax the speed requirements for the adaptive path, a sub-rate alternative to the full-rate adaptive scheme proposed in [16] is also implemented. As shown in Figure 29, the demultiplexed outputs of the retimers reflect the same bits as the full-rate data, and mode detection performed on the sub-rate output should also function properly for the adaptation loop. At the



Figure 29. Comparison between full-rate and half-rate mode signals

same time, phase detection for sub-rate architectures also relies on the same sub-rate samples; thus the timing differences between the mode and phase signals are much more easily controlled. However, due to this change in architecture, there will be discrepancies between the error signal generated for adaptation using the half-rate approach and that of the full-rate approach. Figure 30 shows two signals with all even, odd, and superposition mode sequence permutations to identify which combinations may deviate from the original adaptation scheme. The only instance where the error signals between the full-rate and half-rate approaches occurs during the superposition mode when no transitions take place and the polarities are identical. Fortunately, in a real-world application with random data, this single scenario is inconsequential for the entire adaptation to converge correctly.

While prior works such as [13] have used CDR phase detection to perform adaptive crosstalk cancellation in the past, knowledge of the channel coupling characteristics was needed. The proposed topology not only determines whether the coupling is dominated by capacitive or



Figure 30. Comparison of full-rate and half-rate adaptation error signals

inductive coupling with a single gate, but also supports sub-rate architectures without the need of additional digital circuits. The total hardware overhead for this adaptation method is only two CML XOR gates and an integrator.

Because the crosstalk cancellation adaptation scheme relies on correct operation of the CDR phase detector, it is instructive to determine which system level parameters may cause issues. Through trial and error in the system-level simulation, the adaptation time is required to be significantly longer than the CDR locking time. Otherwise, the CDR may fail to properly perform the phase detection needed to determine the optimal reference point. In this section, the

adaptation time is derived and then compared to the system-level simulation to verify its accuracy.

The feedforward path of the adaptation scheme includes only the crosstalk cancellation circuit while the feedback path includes the CDR phase detector, mixers, and analog integrator. For the simplest case scenario, a linear phase detector with gain  $K_{PD}$  and units  $\frac{V}{ps}$  is considered together with an integrator with transfer function H(s). The mixers are assumed to have unity gain, and the crosstalk cancellation circuit is also assumed to have gain  $K_{XTC}$  and units  $\frac{ps}{v}$ . One important distinction to make here is that while the value of  $K_{XTC}$  is defined as the amount of peak-to-peak jitter reduced per change in control voltage, the actual observable value of K<sub>XTC</sub> for the purposes of adaptation is cut in half. This is because the transitions impacted by crosstalk due to even- and odd-mode excitations only deviate from the reference superposition mode transitions by half the peak-to-peak value. Because the crosstalk adaptation only functions when even- and odd-mode excitations occur, the entire loop gain is modified by a scalar  $\beta$  equal to the percentage of even and odd mode transitions out of all transitions. For example, for two signals PRBS7 and PRBS9, there are in total 127\*511 = 64897 possible permutations for unique transitions. Of these total transitions, only 37.9% of these will be even- or odd-mode transitions. The system-level block diagram used for calculation of the adaptation time is shown in Figure 31.



Figure 31. Adaptation time system level block diagram

Using this figure, the transfer function can be derived as

$$\frac{t_{XTC}}{t_{in}} = \frac{0.5K_{PD}K_{XTC}H(s)}{1 + 0.5\beta K_{PD}K_{XTC}H(s)}$$
(1)

In practice, integrators are realized as high-gain single-pole op-amps with transfer function

$$H(s) = A_v \frac{1}{1 + sRC} = A_v \frac{1}{1 + s\tau}$$
(2)

Combining (1) and (2) gives the following expression:

$$\frac{t_{XTC}}{t_{in}} = \frac{0.5K_{PD}K_{XTC}A_v \frac{1}{1+s\tau}}{1+0.5\beta K_{PD}K_{XTC}A_v \frac{1}{1+s\tau}} = \frac{0.5K_{PD}K_{XTC}A_v}{1+s\tau+0.5\beta K_{PD}K_{XTC}A_v}$$
$$= \frac{\frac{0.5K_{PD}K_{XTC}A_v}{1+0.5\beta K_{PD}K_{XTC}A_v}}{1+s\frac{\tau}{1+0.5\beta K_{PD}K_{XTC}A_v}}$$

To verify this result, different pattern combinations ranging from '10101010' repeating to PRBS31 were used on both signal paths. The behavioral model simulations were each run only after ensuring the CDR had sufficient time to lock and reach steady-state, and then enabling the crosstalk cancellation loop. While the adaptation scheme still converges on the correct value

even while the CDR is still locking, the CDR is made to lock first to fix the clock phase in order to avoid any confounding factors when analyzing only the adaptation loop. The results of these simulations are shown in Table 7. For these simulations, only the phase information from the phase detector for Pattern 1 was used for the adaptation scheme.

While the simulated results where Pattern 1 is made up of '10101010' repeating or PRBS9 are very close to the theoretical value, the difference noticeably increases as the consecutive identical digits (CID) of the pattern increases. To verify this result, PRBS9 was interchanged with PRBS15, PRBS23, and PRBS31 as the pattern chosen for phase detection, which reduced the error. In order to understand why this occurs, the CDR phase detector is fed with a clock signal fixed in phase at the center of the data eye. Under these circumstances where the CDR is no longer continuously updating the clock phase, the difference between the theoretical and simulated results drops down considerably. Because of the increased CID for patterns such as PRBS31, the VCO output will drift during long runs of 1s or 0s, which will shift the reference point for phase detection, and thus ultimately impact the crosstalk cancellation adaptation time.

| Pattern 1 | Pattern 2 | Simultaneous<br>Transition<br>Density | Simulated to<br>Theoretical<br>Adaptation Time<br>Difference |
|-----------|-----------|---------------------------------------|--------------------------------------------------------------|
| 10101010  | 11001100  | 0.5                                   | 4.4%                                                         |
| 10101010  | PRBS7     | 0.50394                               | 4.4%                                                         |
| 10101010  | PRBS9     | 0.50098                               | 4.7%                                                         |
| PRBS9     | PRBS7     | 0.2525                                | 7.0%                                                         |
| PRBS9     | PRBS15    | 0.2507                                | 6.9%                                                         |
| PRBS9     | PRBS23    | 0.25                                  | 6.2%                                                         |
| PRBS9     | PRBS31    | 0.25                                  | 6.2%                                                         |
| PRBS15    | PRBS9     | 0.2507                                | 10.4%                                                        |
| PRBS23    | PRBS9     | 0.25                                  | 13.1%                                                        |
| PRBS31    | PRBS9     | 0.25                                  | 17.4%                                                        |

### Table 7. Difference between simulated and theoretical adaptation times

Through both system and circuit level simulation, it was found that the adaptation loop can function correctly as long as the clock frequency is correct. That is, the CDR's phase detection functionality does not need to be locked to the center of the eye for proper convergence. This is an extremely important asset as often before adaptation occurs, the data eye may be closed sufficiently to prevent the CDR from phase-locking. As the system-level simulation has shown the adaptive loop to be quite robust, the circuit realization of this architecture will be explained in depth next chapter along with circuit level simulation results.

## **Chapter 5 Circuit Design, Layout, and Simulation Results**

In this chapter, the circuit design methodology and underlying concepts will be discussed. Because the designed receiver operates at very high data rates (50-Gbps/lane), many different topologies and their tradeoffs were considered. The chip was designed and fabricated using the SBC18H3 SiGe BiCMOS process by TowerJazz. While the npn bipolar transistor in this process has  $f_T$  in excess of 200 GHz, the loading effects of each stage must be carefully considered. Therefore, the order in which these were designed is the reverse of the data path in the receiver, and starts at the output driver, to the CDR, crosstalk cancellation, and finally equalizer.

### **Issues Encountered with Low-Voltage BiCMOS Design**

As the chip is designed with a nominal supply voltage of 2.0V in mind, stacking multiple bipolar and MOS transistors in series is difficult due to headroom limitations. With a simulated  $V_{BE}$  around 900mV for optimal current density, emitter followers inserted in buffer chains, as illustrated in Figure 32, are not a viable option. For the case where no current is conducted through Q<sub>1</sub> and  $V_I = V_{CC}$ , then  $V_2 = V_{CC}-V_{BE}$ . Likewise, when all the current is conducted through Q<sub>1</sub>, then  $V_I = V_{CC}$  -  $I_{buf}R_L$  and  $V_2 = V_{CC} - I_{buf}R_L - V_{BE}$ . For full switching to occur at the input of each CML buffer, a swing in excess of 180mV must be achieved; thus, for a supply of 2V and  $V_{BE} \approx$  900mV,  $V_2$  will vary between 0.92V and 1.1V. While the emitter follower biasing transistors can remain in the forward-active region even with these collector voltages, any supply variation can easily drive them into saturation. More importantly, the emitter-coupled node of the second differential pair will then vary from 20mV to 200mV. Implementing a MOS current source is also problematic—a high overdrive voltage is required for robust operation, while a small W/L is required to limit the capacitance seen at the emitter-coupled node, and neither criteria can be met given the low headroom.



Figure 32. Typical buffer chain in BiCMOS process



Figure 33. Cascaded CML buffers with no emitter-follower

Since emitter followers cannot be used, cascaded CML buffers were chosen as shown in Figure 33. As shown earlier,  $V_1$  (and the base of  $Q_1$ ) will vary between  $V_{CC}$  and  $V_{CC}$ - $V_{swing}$  as will  $V_2$ ; thus, Q2 will inevitably run into saturation. Both the gain variation in  $Q_1$  and the change in input impedance of  $Q_2$  result in noticeable distortion at the output. In order to mitigate this effect, the swing for each stage must be kept to a minimum while ensuring that full switching occurs; for this design, the swing for all blocks in the analog front-end is kept to within 200mV to account for some PVT variation. Additionally, while AC simulation of minimum-length npn transistors indicate  $f_T$  above 200 GHz, the effect of input impedance variation on the data signal is also smaller with larger device sizes.

## **Output Drivers**

With the full-rate data at 50-Gbps, proper characterization of the channel is required. The expected channel is a differential 1-inch microstrip line on Rogers 4350B material, as shown in Figure 34(a), and its corresponding S-parameters in Figure 34(b). There is also a gold bondwire with an expected inductance of around 150pH. Based on simulations in both HFSS and ADS electromagnetic simulation tools, the expected insertion loss Sdd21 at 12.5GHz and 25GHz are -0.45dB and -0.84dB, respectively. These correspond with the frequencies of interest for 25Gbps and 50Gbps data that will be driven at the output for testing purposes. Because the test port of the chip is expected to output 50Gbps data with crosstalk present, the actual bandwidth of the output driver is required to exceed 25GHz. Using these S-parameters in Cadence Virtuoso, an output stage with emitter length of 2.64µm and bias current 5.6mA is required to drive 50Gbps data with a predetermined "worst-case" amount of crosstalk. With shunt-peaking, the driver together with the channel can actually achieve a maximally flat response with a bandwidth of 65GHz.



(a)



(b)

Figure 34. (a) 1" Differential Channel Modeled in HFSS; (b) Corresponding S-parameters

One other critical issue for the output driver design is routing. As the chip is quite large, it is difficult to place the output driver close to where the optimal probing point would be. For this chip, the 50Gbps output driver was located  $360\mu m$  away from the crosstalk cancellation circuit output. In order to drive such a long interconnect, a high-speed pre-driver with small input capacitance is preferred so as not to needlessly load the crosstalk cancellation output. A pair of emitter followers are used to fulfill these criteria together with a high-pass filter to shift the biasing level to  $V_B$  which is a generated using a reference current. The high-pass filter is made up of a Metal-Insulator-Metal (MIM) capacitor and a high-density poly resistor. The values of  $R_B$  and  $C_B$  are determined to account for a random signal with 200 consecutive identical digits. An additional shunt-peaked ECL buffer is used to restore the signal amplitude to ensure full current-switching occurs for the output stage. The schematic for the entire 50-Gbps output driver is shown in Figure 35.



Figure 35. Schematic for 50Gbps output driver stage

For the remaining two output drivers, one driving the half-rate 25-Gbps data from the CDR, and the other a sub-rate 3.125 GHz clock, the emitter follower and high-pass filter are not needed

since a single CML buffer would suffice in each path despite the similar interconnect routing length. The layout for the output driver and pre-driver is shown in Figure 36.



Figure 36. Layout for predriver and output driver

### **Clock & Data Recovery Circuit**

## Voltage-Controlled Oscillator (VCO) Design

For an operating speed of 50 Gbps, a half-rate binary CDR is adopted due to speed and power considerations. In this section, the design of each block will be discussed, along with any tradeoffs and choices made for high-speed sections.

Using high-speed npn transistors, 50 GHz clock generation is possible; however, through noise simulation, the device noise of npn transistors in this process is significantly higher (more than 15dB) than that of nmos transistors. In addition to higher noise, the npn transistor swing is severely limited. Shown in Figure 37, the nmos and npn transistors are driven with the same tail current and comparable input capacitances. While the npn cross-coupled pair exhibits a much greater negative-g<sub>m</sub>, the useable range for a negative-gm oscillator is significantly smaller than that for an nmos cross-coupled pair.



Figure 37. Transconductance

This can be remedied using level-shifting as shown in Figure 38. In Figure 38(a), the singleended output voltages of  $V_2^+$  and  $V_2^-$  are centered around  $V_{CC}$ , but the ac-coupling from  $C_B$  and  $R_B$  biases the inputs of the npn transistors at  $V_B$ ; by keeping this voltage quite a bit lower than  $V_{CC}$ , the output swing can be maximized. The primary drawback of this approach is the large backplate capacitance seen from  $C_B$  and the noise generated from realizing a large resistance  $R_B$ . For high frequencies, both  $C_B$  and  $R_B$  can be scaled down, but will still load the resonant tank and decrease Q. The other level-shifting approach shown in Figure 38(b) uses emitter followers. The addition of  $R_{CC}$  is required to ensure that the emitter followers do not go into saturation. With  $V_2^+$  and  $V_2^-$  centered at  $V_{CC}$ -IvcoR<sub>CC</sub>, the base of the VCO transistors  $V_1^+$  and  $V_1^-$  would be centered about  $V_{CC}$ -IvcoR<sub>CC</sub>- $V_{BE}$ . The primary advantage of this topology is the integration of high-speed output buffers; additionally, the emitter followers exhibit inductive components in the input impedance, which may actually help oscillation. However, the additional phase shift introduced by the emitter followers in the feedback path limits the upper range of frequencies available. Also, since the emitter follower output drives both high-speed sampling circuits as well as the inputs of the VCO transistors, the VCO becomes very susceptible to any broadband kickback from the sampling circuits.



Figure 38. Level shifting in VCOs using (a) high-pass filtering and (b) emitter followers

Given the issues involved with both the noise performance and limited swing when using npn transistors, it would be best if a high-speed CMOS VCO could be used instead. As the nmos parasitic capacitances for a 180-nm process are quite high, design of a CMOS VCO at 50 GHz while maintaining an acceptable tuning range of  $\pm 5\%$  is difficult. In fact, through circuit simulation, in order for oscillation to occur, the nmos transistors would need to be scaled up and

even then could only reach an oscillation frequency of 50 GHz in the absence of any tuning elements. With varactors added in the tank, the Q would drop significantly and require even larger transistors to boost the loop gain of the oscillator, resulting in larger parasitic capacitances and lowering the oscillation frequency. Therefore, a half-rate architecture using a 25-GHz VCO was chosen instead.

Before continuing design of the VCO, it is very important to consider what kind of clock is required for the half-rate CDR. In particular, the main differences come from whether the CDR phase detector is linear or binary, as this choice not only impacts whether or not quadrature clock generation will be required, but also determines the total capacitive loading seen for the clock driver.

While a half-rate linear phase detector like the one in [17] is composed of only 4 latches, the front-end latches must be capable of outputting signals on the order of 2 times the data rate. For this chip, that would necessitate a latch capable of outputting data at 100 Gb/s and would also put strict rise-fall time restrictions on the clock signal. Through circuit simulation, a single CML latch capable of this would require a 12mA bias current.

A half-rate binary phase detector on the other hand requires six D Flip-Flops but has relatively lower speed requirements. The front-end latches only need to sample the full-rate data and can tolerate outputting data at half-rate since the following latch in series will resample at half-rate as well. Through circuit simulation, this latch only requires about 3mA to meet the requisite performance. However, half-rate binary phase detectors require quadrature clock generation, which will substantially increase the power and complexity of the VCO and clock distribution network. Based on post-layout simulations, a half-rate binary phase detector together with quadrature clock generation draws about 30mA whereas a half-rate linear phase detector with a standard clock generation draws 38mA. Thus, a quadrature VCO (QVCO) is chosen for this design.



Figure 39. Initial QVCO schematic

The circuit shown in Figure 39 is initially used to realize a QVCO running at 25GHz. The cross-coupled pair transistors are sized at  $32\mu$ m/0.18 $\mu$ m but are drawn as 16 x  $2\mu$ m/0.18 $\mu$ m devices in parallel in order to shorten the gate poly resistance to decrease loss in the VCO. The transistors used to excite the quadrature coupling are required through simulation to be about one-quarter the size to properly force the phases into quadrature. Since these are 0.18 $\mu$ m devices, the parasitic capacitances are non-trivial for 25GHz clock generation, especially with an additional 25% capacitance present for quadrature phase generation. Initially, adding an inductor at the source-coupled node for each VCO as seen in [18] was considered, as the coupling between the two inductors will reinforce the coupling between the two VCOs. This would allow

for transistors M5-M8 to be sized smaller at 2 x  $2\mu$ m/0.18 $\mu$ m while still maintaining quadrature phases. However, the additional parasitic capacitance introduced by the sidewall capacitance from routing top-level metals turned out to increase the overall capacitance in the tank, lowering the oscillation frequency. Instead, the two VCO cores share a current source as shown in Figure 40. The shared current source allows higher peak current for each individual VCO. Additionally, although the drain of M9 sees a large capacitance, a small portion of the current from each VCO flows into the other branch, slightly reinforcing the quadrature phase-locking.



Figure 40. Final QVCO schematic

In order to tune the VCO, variable capacitance is often implemented as a switch-capacitor bank together with varactors. As CMOS technologies become more advanced, the "on-



(a)



(b)

Figure 41. Standard capacitor banks

resistance"  $r_{on}$  of a MOS transistor acting as a switch becomes smaller for similar parasitic capacitance values. Unfortunately, in a 180nm process, a wider transistor is needed in order to

achieve comparable  $r_{on}$  values; however, wide transistors also have larger parasitic capacitances. This creates a strict tradeoff between decreasing  $r_{on}$  and decreasing the capacitance. Figure 41 shows typical implementations of a capacitor bank with (a) being single-ended outputs, and (b) being for differential outputs. The shared switch in Figure 41(b) actually benefits from reducing the differential-mode series resistance seen across the effective capacitor. To switch bands at these frequencies, the discrete capacitance values have to be very small; therefore, both the size of the MIM capacitors used and the nMOS transistors need to be minimized. For a desired tunable capacitor with a tunability of 30fF, the quality factor Q for when the nMOS switch is off is 4.5 whereas if the switch is on, the Q drops down to only 0.7. As the overall Q of the tank is given by  $\left(\frac{1}{Q_L} + \frac{1}{Q_C}\right)^{-1}$ , the low Q of the capacitor bank greatly reduces the loop gain of the VCO, making oscillation very difficult. In post-layout simulation, using this type of capacitor bank required 22mA total to ensure the QVCO oscillated across all temperatures of interest, 0 to 120C.

The discrete tuning circuit used in this chip borrows from the idea that reverse-biased diodes are often used as varactors for very high-speed oscillators. In this case, these varactors are controlled by connecting each emitter to either ground or  $V_{DD}$ , as shown in Figure 42. Because capacitances for discrete diodes in this design kit were too large, npn transistors were used instead. The base and collector are both shorted to ground, and the base-emitter junction is chosen as the varactor because the capacitance seen across the base and collector is too small to offer adequate tuning range. Compared with the nMOS switch variant shown in Figure 41(b), the Q of a tunable capacitor with the same tuning range of 30fF varies from 6.5 to 8.3, relaxing the power requirements for the VCO to begin oscillation. In fact, because the Q of the tank is much higher than before, the transistor sizes can be reduced by 20% to 1.6 $\mu$ m-long fingers, which in turn allows for more flexibility with the tuning range. In the post-layout simulation of

the final VCO design, the QVCO alone draws only 6mA while also having a tuning range of 10%.



Figure 42. Capacitor bank realized using npn transistors instead of MOS switches

The candidates for the VCO clock driver were conventional LC-tuned buffers or emitter followers. Initially, a tuned buffer with MOS inputs was preferred due to its wider input range; however, the transconductance of the nMOS transistors was not enough to maintain a good output swing without either increasing the power substantially or increasing the transistor size, which would both decrease the VCO oscillation frequency and the tuning range. While swapping in npn transistors will boost the gain of the clock driver, the finite input resistance introduced due to the base of each transistor lowers the Q of the tank substantially and impacts the VCO oscillation startup requirement. Additionally, as discussed later in the phase detector design, the clock driver output swing needs to be somewhat controlled to ensure optimal performance for the D flip-flops. Finally, additional capacitive tuning is required to account for process variations and modeling inaccuracies, which further lowers the Q of the clock driver and increases the power consumption required to achieve the same swing. Due to these factors, emitter followers were chosen as clock drivers, with a simplified schematic shown in Figure 43(a). Assuming both C<sub>Bias</sub> and R<sub>Bias</sub> are sufficiently large, and assuming  $r_0 \rightarrow \infty$ , the equivalent circuit is shown in Figure 43(b).  $R_E$  and  $C_E$  are the lumped resistances and capacitances seen at the emitter follower output, respectively. From the small-signal equivalent circuit, the real part of the input admittance of the emitter follower is given by:

$$Re\{Y_{in}\} = \frac{1 + \omega^2 \frac{r_{\pi}R_E [r_{\pi}C_{\pi}^2 + R_E C_E (C_E - \beta C_{\pi})]}{(\beta + 1)R_E + r_{\pi}}}{1 + \omega^2 \left[\frac{r_{\pi}R_E (C_{\pi} + C_E)}{(\beta + 1)R_E + r_{\pi}}\right]^2}$$

At high frequencies, the condition to achieve a negative real part in the admittance is given by:

$$r_{\pi}C_{\pi}^{2} + R_{E}C_{E}^{2} - \beta R_{E}C_{E}C_{\pi} < 0$$

where  $R_E$  is the parallel combination of the emitter follower output resistance  $r_o$  and the input resistances of each clocking transistor in the phase detector. Although the biasing points are different, the input resistance for each clocking transistor can be approximated as a fixed value of  $r_{\pi}$  and as discussed later, each emitter follower drives 4 npn transistors. Therefore,  $R_E = \frac{r_{\pi}}{4} ||r_o \approx \frac{r_{\pi}}{4}$ 

The load capacitance  $C_E$  is made up the input capacitances of each clocking transistor  $C_{\pi}$  and parasitic capacitances from the routing  $C_{parasitic}$ ,  $C_E = 4C_{\pi} + C_{parasitic}$ , giving the following condition

$$r_{\pi}C_{\pi}^{2} + \frac{r_{\pi}}{4} \left( 4C_{\pi} + C_{parisitic} \right)^{2} - \beta \frac{r_{\pi}}{4} \left( 4C_{\pi} + C_{parasitic} \right) C_{\pi} < 0$$



Figure 43. Emitter-follower with loading (a) schematic and (b) equivalent circuit

This relationship will hold true as long as  $C_{parasitic} < \left(\frac{\beta + \sqrt{\beta^2 - 16}}{2} - 4\right) C_{\pi}$ . With  $\beta \gg 4$ , the input admittance of the emitter follower is guaranteed to be negative, which helps the VCO oscillation startup conditions. For this process, the nominal value of  $\beta$  is 100 and through post-layout simulation,  $\frac{C_{parasitic}}{C_{\pi}} \approx 3$  for this design.



Figure 44. 25GHz QVCO and clock driver layout

The final layout for the 25GHz QVCO with clock drivers is shown in Figure 44. The inductors chosen are 150 $\mu$ m x 150 $\mu$ m octagonal single-turn inductors drawn using the top-most metal layer with a thickness of 2 $\mu$ m. Single-turn inductors were chosen for this application due to their higher self-resonant frequency (SRF). The clock-drivers are placed to the right of the VCO core, and since emitter followers do not contain inductors, which would occupy a large area, the routing distance between the VCO output and clock driver input can be reduced by at least 40 $\mu$ m. As shown in Figure 45, the VCO achieves a tuning range of ±4.5%. The closed-loop CDR's VCO phase noise is also shown in Figure 46, with 18.91fs rms-jitter generation expected, integrating from 10kHz to 100MHz.



Figure 45. VCO frequency bands



Figure 46. Simulated phase noise of clock and divide-by-two output

#### **Phase Detector**

As shown in Figure 47, the half-rate binary phase detector is made up of 3 D Flip-Flops (DFF) and two XOR gates. The DFFs are each composed of two high-speed CML latches, and the XOR gates are based on a high-speed design from [19].



Figure 47. Half-rate binary phase detector



Figure 48. Standard high-speed latch schematic for BiCMOS process

Due to the high operating speed, the clock-to-Q delay of each latch must be minimized. Figure 48 shows a standard design for a high-speed latch in BiCMOS processes where a higher supply voltage is allowed. Due to the stacked transistors in this design, the minimum value of  $V_{CC}$  to ensure no npn transistor enters saturation and that no nMOS goes into triode would be

$$V_{CC} \ge \frac{I_{Latch}R_C}{2} + 2V_{BE} + CLK_{CM} + \frac{V_{CLK}}{2} - Bias_{Latch}$$

This value can easily exceed the 2V supply for this process technology. Initially, a DFF topology such as the one in Figure 49 was considered where ac-couplers were used to level shift the data path. The value of V<sub>B</sub> would be chosen to be below  $V_{CC} - \frac{3}{2}V_{swing}$  to ensure the npn transistors did not go deep into saturation. Unfortunately, not only does the biasing become more complicated, this topology can only handle data patterns within a certain amount of CID. Additionally, for the nMOS pair to fully switch, the single-ended swing must exceed 540mV,



Figure 49. D-Latch using AC coupling to level shift

which is not trivial at these operating speeds. Therefore, the circuit in Figure 50 was considered instead. Due to the high current required for high-speed operation, the current source MOS transistor  $M_1$  would need to be sized considerably larger to avoid running into triode, resulting in the capacitance seen at node  $V_{CS}$  to be quite large. Since the clock frequency is 25GHz, the capacitance seen at  $V_{CS}$  will behave as a short to ground during operation. Therefore, the pseudo-differential D-Latch shown in Figure 51 is used. Because the current flowing through the latch is not constant for pseudo-differential clocking, great care must be taken to avoid abnormally large currents that would force the data path transistors deep into saturation. Since the clock driver for the VCO is an emitter follower, the output swing of the clock buffer is constrained due to nonlinearities. As the VCO swing increases and brings the emitter follower's input transistor deep into saturation, the base-emitter voltage drop no longer scales linearly with



Figure 50. CML D-Latch



Figure 51. Pseudo-differential DFF

the input voltage for a fixed current. The output swing of the clock buffer varies between 190mV to 220mV only across operating conditions. As a DFF is made up of two cascaded D-latches, the first latch is shunt-peaked to improve bandwidth in order to properly sample the 50Gbps data, while the second latch does not require additional bandwidth extension as its output only needs to be capable of sampling 25Gbps data.

While the speed requirement of the CML XOR gates is less critical than the DFF [20], the asymmetry in signal paths between signals A and B in Figure 52 causes problems for this design. First, the common-mode levels of A and B are different, requiring an additional level-shifter for signal B. Second, the difference in signal paths for A and B results in different delays, which may result in the XOR operation malfunctioning at high speeds. Third, as explained previously for the D-Latch design, the voltage headroom makes it difficult for design at low supply voltages. Instead, the symmetric XOR gate in Figure 53 is used [19]. Not only is the

common-mode level and propagation delay to the output identical in this topology, the headroom requirement is no different than that of a CML buffer.



Figure 52. Conventional CML XOR



Figure 53. Symmetric CML XOR gate

Figure 54 shows the integrated phase detector output with respect to clock delay with a PRBS9 input data stream. Because the binary phase detector gain is susceptible to change based upon the jitter on the incoming signal, it is useful to properly gauge the range of phase detector gains to ensure the CDR phase margin is not compromised. The best-case scenario is meant to illustrate the phase detector at a temperature of 0°C with clean input data. On the other hand, the worst-case scenario represents the phase detector at a temperature of 120°C with a large amount of crosstalk still present on the data. Based on these plots, the phase detector gain can vary by a factor of 5.6 depending on the operating conditions.



*Figure 54. Simulated best- and worst-case phase detector output characteristic* 

Unlike full-rate binary phase detectors, half-rate binary detectors were found to have extremely limited frequency acquisition ranges. Therefore, a PLL is required to first frequency lock before transitioning over to the CDR to phase-lock to the incoming data stream. A standard dual-loop architecture is used for this design with a standard bang-bang phase-frequency detector. For this chip, a divide ratio of 64 was chosen with a reference clock frequency of 390.625MHz. Although neither the phase-noise performance nor the loop characteristics of the PLL are critical to the crosstalk cancellation loop performance, proper divider design is required to both lower the power consumption and ensure the divider functions correctly for the frequency ranges of interest. Due to speed considerations, the first three dividers after the VCO are realized using CML D Flip-Flops while the last three dividers are realized as TSPC latches, as shown in Figure 55. For the CML dividers, the tail current was decreased, and the load resistor increased as the divider frequency lowered. For the TSPC divide-by-two design, the lengths of the dividers were increased as the divider frequency lowered. As shown in Figure 56, the divider sensitivity curves show the divider's minimum sensitivity points are close to the desired output frequencies, ensuring the entire divider chain will work appropriately. In simulation, these curves were defined by placing a minimum SNR for the dividers. However, these results are pessimistic and in the lab measurement the divider chain does self-oscillate at the divide-by-8 test output at 3.8GHz.



Figure 55. TSPC divide-by-two







Figure 56. CMU Divider sensitivity curves

#### **Crosstalk Cancellation Circuit**

The crosstalk cancellation (XTC) circuit is a simple current-mode summation that adds or subtracts the derivative of the aggressor from the victim signal. The original XTC circuit from [16] is shown in Figure 57. By steering current from the differentiator branch by adjusting the control voltage, the crosstalk cancellation coefficient can be either positive or negative. Unfortunately, the large capacitances from the nMOS transistors both at the output and the current sources limited the bandwidth and the crosstalk that could be cancelled, respectively, so the design was updated to the one shown in Figure 58. The currents flowing through each differentiator branch are directly adjusted by changing the biasing of the current sources. The total current flowing into the load resistor is kept constant just as in the original schematic, but the absence of nMOS transistors allows higher operating speeds. The lack of cascode transistors also eliminates the need for any level-shifting that was required previously as well.



Figure 57. Original current-mode summation for crosstalk cancellation



Figure 58. Updated current-mode summation for crosstalk cancellation

To understand how the XTC circuit functions, the output current can be derived using a small-signal model. With the victim signal being represented by  $V_1$  and the aggressor by  $V_2$ , the output current can be shown to be

$$I_{out} = V_1 \frac{\beta}{(1+\beta)R_E + r_{\pi 1}} \frac{1}{1 + s \frac{R_E r_{\pi 1} C_{\pi 1}}{(1+\beta)R_E + r_{\pi 1}}} + V_2 \frac{\beta}{1+\beta} s C_E \left[ \frac{1}{1 + s \frac{r_{\pi 2} (C_{\pi 2} + C_E)}{1+\beta}} - \frac{1}{1 + s \frac{r_{\pi 3} (C_{\pi 3} + C_E)}{1+\beta}} \right]$$

With  $r_{\pi} = \beta/g_m$  and  $g_m = I_C/V_T$ , the output current can be approximated as

$$\begin{split} I_{out} &\approx V_1 \frac{\beta}{(1+\beta)R_E + r_{\pi 1}} \frac{1}{1 + s \frac{R_E r_{\pi 1} C_{\pi 1}}{(1+\beta)R_E + r_{\pi 1}}} \\ &+ V_2 \frac{\beta}{1+\beta} s C_E V_T \left[ \frac{1}{1 + s \frac{(C_{\pi 2} + C_E)}{I_{C,2}}} - \frac{1}{1 + s \frac{(C_{\pi 3} + C_E)}{I_{C,3}}} \right] \end{split}$$

As can be seen from the above expression, the poles of the differentiator branches can be adjusted by comparing  $I_{C2}$  and  $I_{C3}$ . In practice, these values were determined through transient simulation rather than AC simulation due to the nonlinear nature of the device and the differentiator circuit in the presence of large signals.

## Equalizer

The equalizer used for this design is a CTLE with a similar operation to the crosstalk cancellation circuit. Instead of the conventional CTLE with a source-degenerated resistor and capacitor as previously shown in Figure 9, the source-degeneration is split into two branches as shown in Figure 59. This topology is chosen to individually optimize the low-pass and high-pass



Figure 59. Simplified schematic of CTLE

characteristics of the CTLE. To further understand this point, ignoring the inductors in the load, the small-signal transfer function can be approximated as

$$\frac{\left|\frac{V_{out}}{V_{in}}\right| \approx \frac{\beta R_{C}}{(1+\beta)R_{E} + r_{\pi 1}} \cdot \left[1 + s \frac{r_{\pi 1}C_{\pi 1}R_{E}C_{E}}{r_{\pi 2}(C_{\pi 2} + C_{E}) + ((1+\beta)R_{E} + r_{\pi 1})C_{E}}\right] \left[1 + s \frac{r_{\pi 2}(C_{\pi 2} + C_{E}) + ((1+\beta)R_{E} + r_{\pi 1})C_{E}}{1+\beta}\right] \left[1 + s \frac{r_{\pi 2}(C_{\pi 2} + C_{E})}{1+\beta}\right] \left[1 + s R_{C}C_{C}\right]$$

with poles and zeroes given by:

$$p_1 = \frac{1}{R_C C_C}$$
,  $p_2 = \frac{1+\beta}{r_{\pi 2}(C_{\pi 2}+C_E)}$ ,  $p_3 = \frac{1+\beta}{r_{\pi 1}C_{\pi 1}}$ 

$$z_{1} = \frac{r_{\pi 2}(C_{\pi 2} + C_{E}) + ((1 + \beta)R_{E} + r_{\pi 1})C_{E}}{r_{\pi 1}C_{\pi 1}R_{E}C_{E}}, \quad z_{2} = \frac{1 + \beta}{r_{\pi 2}(C_{\pi 2} + C_{E}) + ((1 + \beta)R_{E} + r_{\pi 1})C_{E}}$$

For this design,  $R_E(1+\beta)$  is about an order of magnitude greater than  $r_{\pi}$ , so assuming that  $R_E(1+\beta) >> r_{\pi}$ ,

$$z_1 \approx \frac{1+eta}{r_{\pi 1}C_{\pi 1}}$$
,  $z_2 \approx \frac{1}{R_E C_E}$ 

With  $p_3$  and  $z_1$  cancelling out, this circuit has one zero and two poles, with  $z_2 < p_1 < p_2$ . The value of  $R_E$  is chosen such that the worst-case DC gain at the high temperature corner will still be unity.  $C_E$  is realized as a pair of varactors, that are controlled either externally by manual control, or by the integrated output of the op-amp in the adaptive equalization loop. Shunt-peaking is also used to further extend the bandwidth of the CTLE; at the minimum peaking setting, the CTLE has a simulated bandwidth of 39GHz.

Because the spectral content of the incoming 50Gbps data is expected to contain FEXT, the CTLE should be able to boost frequencies beyond 25GHz. With the expected worst-case CIJ present, the shortest unit interval of the data is about 13ps in the absence of channel loss, and the

longest unit interval to be about 27ps; therefore, the CTLE should give sufficient high-frequency boosting between 19.2GHz and 38.4GHz. Because the focus of this research is on crosstalk cancellation rather than equalization, the worst-case channel to demonstrate the adaptive crosstalk cancellation scheme will only have a loss of 7dB; therefore, the CTLE is designed for up to 10dB peaking at 25GHz for adequate margin.

## **Post-Layout Simulation Results**

The final layout of the chip is shown in Figure 60. The incoming data comes in through pads 4, 5, 7, and 8 directly to the CTLE. The filtering capacitor required for adaptation for both the equalization and crosstalk cancellation loops are all realized on chip, while the large



Figure 60. Final layout of the chip

capacitors required for the CDR loop filters are routed off chip. The chip has three test ports to be used for chip testing: pads 11 and 13 are for analyzing the 3.125GHz sub-rate clock, pads 55 and 57 are for outputting the de-muxed 25Gbps half-rate data out of CDR\_0, and pads 61 and 63 will drive the full-rate 50Gbps data right after the crosstalk cancellation takes place. To simulate this chip, different pattern combinations and different channels via various lengths of parallel microstrip lines were used to ensure both the adaptive equalizer and adaptive crosstalk cancellation circuit functioned properly. As shown in Figure 61, the channels were simulated using both HFSS and ADS electromagnetic simulation tools. The substrate material chosen was Rogers 4350B for a cost-effective solution



Figure 61. (a) Parallel microstrip lines in HFSS; (b) cross-sectional view as seen in ADS

suitable for high-speed operation. The substrate thickness is 4mils, the conductor thickness is 0.685mils, and both the conductor width and microstrip spacing are 6mils. Although, Figure 61 (a) shows only a 1-inch channel with the aforementioned parameters, the worst-case channel simulated is a 3-inch long channel with the S-parameters shown in Figure 62. Additionally, other lengths and combinations of widths and spacings were tested as well to ensure the crosstalk cancellation circuit converged under as many scenarios as possible.



Figure 62. S-parameters for expected worst-case channel

The control voltages for the adaptive loops are shown in Figure 63. The adaptive equalization converges much faster than the adaptive crosstalk cancellation. Unfortunately, the adaptation time for the crosstalk cancellation loop is quite difficult to accurately determine for



Figure 63. Control voltages for both adaptive equalizer and crosstalk cancellation loops

this particular architecture because of the binary phase detector. As shown in Figure 54 previously, the phase detector gain varies greatly depending on how much jitter is present on the incoming data to the CDR. The filtering capacitor for the adaptive crosstalk cancellation loop was increased until convergence was observed. While the convergence time for the crosstalk cancellation loop is only required to be at least two to three times longer than the adaptive equalization loop as shown above, in the final design the crosstalk cancellation loop convergence time is designed to be at least ten times the minimum requirement to lower the disturbances that may appear on the control voltage.

Figure 64 and 65 show the eye diagrams for critical points in the receiver data path after the CDR and all adaptation loops have locked. As seen in Figure 64, even after the CTLE compensates for channel loss, there is still a significant amount of deterministic jitter remaining. Figure 65(a) shows (in green) the data after passing through the crosstalk cancellation circuit, and the 25GHz clock signal locked to the center of the eye. Because probing the XTC output onchip is not possible, it is mandatory to understand how much degradation of the data eye occurs between the XTC output and the eye diagram that will be viewed during physical testing. Therefore, the pad, bond wire inductance, and microstrip channel are all accounted for in simulation. Figure 65(b) shows the eye diagram expected when reaching the oscilloscope. Table 8 summarizes the results and compares the eye openings with optimal manual adjustment and with the adaptation scheme. The adaptation comes very close to the optimal values obtained through manual adjustment. This comparison was also made with different combinations of data patterns and channel characteristics. Figure 66 shows another group of waveforms for a case where the channel is only 2" long and contains less channel loss and crosstalk. Again, both adaptation loops converge to their optimal values.



Figure 64. CTLE output for worst-case channel







(b)

Figure 65 (a) XTC and clock, and (b) 50Gbps test port outputs for worst-case channel

| Signal        | Optimal Eye Opening (ps) | Adaptation Eye Opening (ps) |
|---------------|--------------------------|-----------------------------|
| RX Input      | 7.9                      | 7.9                         |
| CTLE Output   | 11.7                     | 11.6                        |
| XTC Output    | 15.6                     | 15.3                        |
| Output Driver | 15.3                     | 15.2                        |

Table 8. Summary of eye opening for worst-case channel



Figure 66. Outputs for a 2" channel

# Chapter 6 Test Results

The test chip was fabricated using TowerJazz's SBC18H3 BiCMOS process. The die was bonded to three different boards, each with a different input channel characteristic.



Figure 67. Die photo before wirebonding

Figure 67 shows a photo of the die sitting in the cavity in the board before the wirebonding process. Although the overall chip functioned correctly, differences between the models given in the simulation design kit and the actual fabricated die resulted in nonidealities that impacted performance. In this chapter, these effects will be covered together with the final frequency- and time-domain measurements.

Although the VCO was designed with 8 bands in mind to achieve a tuning range greater than 2GHz, changing between the lowest and highest bands shows the frequency shifts by less than 500MHz. Within a single band, the observed VCO tuning range is also only about 180MHz. With a design target of a 25GHz VCO, the VCO tested in the lab ranges from 24.25GHz to 24.72GHz. All 8 bands appear to have shifted to the low-end of the frequency range, and the varactor tuning range also seems to be about half the expected value.

Since the VCO capacitor bank is made up of reverse-biased PN junctions, the lack of range between bands indicates that the change in capacitance across the base-emitter junction is much less than what is shown in simulation. Likewise, the lack of tuning range within a single band indicates the MOS varactor impedance around 25GHz is not well modeled either. Despite these issues, the CDR is still able to lock between 48.6Gbps and 49.4Gbps, with the maximum shared rate between parts at 49.38Gbps, which will be the basis for all transient data presented later in this chapter.

At the initial stage of testing, the recovered clock was characterized using a Keysight E5052B Signal Source Analyzer (SSA). The following spectra shown in Figure 68 and 69 are



Figure 68. Input 49.12Gbps '1111000011110000' data retimed by half-rate recovered clock



Figure 69. Divide-by-8 of recovered clock with input 49.12Gbps '1111000011110000' data

for a repeating '1111000011110000' input data stream. Figure 68 shows the data retimed by the recovered clock with rms jitter of about 205fs whereas the divide-by-8 clock (Figure 69) shows about 250fs rms jitter. Figure 70 shows the same divided down clock for a 49.12Gbps PRBS7 input data stream. While some additional spurs seem to appear at the high frequency range of the phase noise spectrum, the overall jitter is not degraded much. However, it seems as the pseudo-random index increased, the spurs became worse and worse, with a worst-case phase noise shown in Figure 71. In the interest of keeping the total noise on the chip low for the

crosstalk cancellation adaptation, only periodic, PRBS7, and PRBS9 signals are used for transient analysis.



Figure 70. Divide-by-8 of recovered clock with input 49.12Gbps PRBS7 data



Figure 71. Divide-by-8 of recovered clock with input 49.12Gbps PRBS31 data

As a comparison to the SSA measurements, Figure 72 shows the demuxed and retimed differential waveform for a repeating '11001100' input data pattern to show both locking and the clock-related jitter of the CDR. Unfortunately, the duty-cycle distortion is very large in this retimed output waveform due most likely to mismatch in the bondwire lengths between the P and N side of the output driver. This was something expected during the board design phase and thought to be acceptable since the retimed output is just a functional check, and not indicative of the full-rate eye before the CDR. Due to area limitations, the matching for this test port was

sacrificed to make sure the test port for the full-rate data would not have such a large amount of DCD.



Figure 72. Jitter measurement of demuxed output of 49.38Gbps '11001100' pattern

For time-domain measurements, the chip is tested on multiple boards and measured using a Keysight 86100D Infinitum DCA-X mainframe with 86118A 70-GHz sampling module and 86107A Precision Timebase. The full testbench is shown in Figure 73. In order to achieve a clean eye at the output, both the equalizer and crosstalk cancellation circuit need to function correctly. As the varactor model is shown to be inaccurate in the VCO tuning frequency already, the same modeling inaccuracy causes a significant degradation in the CTLE performance. Adjusting the CTLE control voltage from minimum to maximum does not change the residual ISI at the output of the test-port at all, indicating that either the capacitance tuning range is too small or that the nominal value of the degeneration cap is completely off from the expected value.



Figure 73. Testbench

As the amount of residual ISI also does not change much between the three boards, it should be concluded that the CTLE peaking frequency is lower than the target of 25-30GHz as originally designed. Unfortunately, because the CTLE output is succeeded by many slicers, it is difficult to figure out just how far the transfer function is from the design target. This unintended peaking in the receiver response amplifies the amount of crosstalk as well, resulting in the mid-length board

to actually have similar eye closure compared to the worse-case in simulation. This combined with a larger bondwire inductance in lab eats into the horizontal eye margin considerably and prevents the worst-case board from functioning properly.

For the mid-length board, the results shown below will be using a PRBS7 signal as the victim and tested with both a repeating '10101010' and PRBS9 aggressor. Figure 74 shows the jitter measurement for a PRBS7 signal with '10101010' crosstalk. The periodic jitter shown here is relatively large—close to 6ps out of the 20ps UI. Observing the jitter histogram decomposition, the RJ/PJ is also a distorted shape compared to a regular Gaussian distribution. Manually tuning the crosstalk cancellation control voltage, the PJ can be decreased considerably as shown in Figure 75. The eye shape is more open, and the RJ/PJ histogram resembles a Gaussian distribution. When using the adaptation loop, the eye also improves considerably, although to a suboptimal value as shown in Figure 76. Across different parts, there is about a 150-300fs residual offset from the optimal setting when using the adaptation.



Figure 74. 49.38Gbps PRBS7 signal with '10101010' crosstalk



*Figure 75. 49.38Gbps PRBS7 signal with '10101010' crosstalk after XTC (optimal setting)* 



*Figure 76.* 49.38*Gbps PRBS7 signal with '10101010' crosstalk after XTC (adaptive setting)* 

For a more practical scenario, PRBS9 traffic is also used in place of the periodic signal as the aggressor signal. Figure 77 shows a PRBS7 eye with PRBS9 crosstalk. In this case, because the coupled PRBS aggressor signal manifests itself as broadband jitter and noise, the scope is unable to distinguish the crosstalk from the random jitter spectrum. Therefore, the eye is first measured in the absence of crosstalk, and then RJ is fixed in place. The previous dual-dirac modeling for PJ is also thrown out and replaced with a Bounded-Uncorrelated Jitter (BUJ) specification to more accurately characterize eyes with large amounts of broadband smearing. As shown in Figure 77, the BUJ is also around 6ps, although this value is not necessarily entirely accurate depending on the accuracy of the fixed RJ value. Still the eye is noticeably improved in



Figure 77. 49.38Gbps PRBS7 signal with PRBS9 crosstalk

Figure 78 with a manual control. The BUJ is decreased from 6ps down to 720fs, and the eye opening is also improved. Using the adaptive loop, the residual BUJ is very close to the optimal setting as shown in Figure 79.

In this section, the test results have been presented showing the adaptive crosstalk cancellation loop functions to some extent. Improvements primarily in more conservative design choices in the CTLE as well as better planning of the board and chip top layout could likely alleviate the large 5-6ps ISI seen on all parts regardless of channel or crosstalk. However, even with these challenges, the adaptive loop still converges for the test signals used.



*Figure 78. 49.38Gbps PRBS7 signal with PRBS9 crosstalk (optimal setting)* 



*Figure 79. 49.38Gbps PRBS7 signal with PRBS9 crosstalk (adaptive setting)* 

## **Chapter 7 Conclusion**

A blind adaptative far-end crosstalk cancellation method is proposed in this dissertation. Along with an analysis on the effect of far-end crosstalk on broadband data, the different approaches present in industry are also compared. The adaptation scheme is presented starting with the method proposed in [16], where the motivation was to find a blind adaptive solution. By simplifying the logic in [15] to a single XOR gate, the hardware overhead was shown to be minimal. Additional analysis of sub-rate clocking schemes then improved upon the original architecture to relax both bandwidth and timing requirements on the data path. The adaptation concepts presented in this dissertation can be scaled to any sub-rate CDR architecture with some modification of the loop parameters.

Despite the challenges associated with modeling inaccuracies, the adaptive crosstalk cancellation scheme is still shown to be functional. The residual crosstalk from a PRBS9 signal could be reduced from 6ps down to 720fs for a 49.38 Gbps data eye. Regardless of the spectral content of the aggressor signal, the adaptation is shown to be close to the optimal settings as long as the CDR is attempting to phase-lock. The die draws 187mA from a 1.8V supply when running at a rate of 2 x 49.38Gbps, translating to a FoM of 3.4pJ/bit.

#### References

[1] B.D. Muer and M.S.J. Steyaert, "A CMOS Monolithic ΔΣ-Controlled Fractional-N
 Frequency Synthesizer for DCS-1800," *IEEE J. Solid-State Circuits*, Vol. 37, No. 7, July 2002, pp. 835-844.

[2] R.C.H. van de Beek, et. al., "A 2.5-10-GHz Clock Multiplier Unit with 0.22-ps RMS Jitter in Standard 0.18-m CMOS," *IEEE J. Solid-State Circuits*, Vol. 39, No. 11, Nov. 2004, pp. 1862-1872.

[3] G. Zhang and M.M. Green, "A 10 Gb/s BiCMOS adaptive cable equalizer," *IEEE Journal of Solid-State Circuits*, Nov. 2005, pp. 2132-2140.

[4] J. Lee, "A 20-Gb/s Adaptive Equalizer in 0.13-µm CMOS Technology", *IEEE Journal of Solid-State Circuits*, Sept. 2006, pp. 2058-2066.

[5] A. Momtaz and M.M. Green, "An 80 mW 40 Gb/s 7-Tap T/2-Spaced Feed-Forward Equalizer in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 45, March 2010, pp. 629-639.

[6] M. Kargar and M.M. Green, "A 10Gb/s adaptive analog decision feedback equalizer for multimode fiber dispersion compensation in 0.13 μm CMOS," in 2010 Proceedings of the ESSCIRC, Sept. 2010, pp. 550-553.

[7] K-J Sham, et. al., "FEXT Crosstalk Cancellation for High-Speed Serial Link Design," Proc.CICC, Sep. 2006, pp. 405-408.

[8] S.Y. Kao and S.-I. Liu, "A 7.5-Gb/s one-tap FFE transmitter with adaptive far-end crosstalk cancellation using duty cycle detection," *IEEE J. Solid-State Circuits*, vol. 48, no. 2, Feb. 2013, pp. 391–404.

[9] K.I. Oh, et. al., "A 5-Gb/s/pin Transceiver for DDR Memory Interface with a Crosstalk Suppression Scheme," *IEEE J. Solid-State Circuits*, Vol. 44, No. 8, Aug. 2009, pp. 2222-2232.

[10] Taehyoun Oh and R. Harjani, "A 6-Gbs MIMO Crosstalk Cancellation Scheme for High-Speed I/Os," *IEEE Journal of Solid-State Circuits*, vol. 46, Aug. 2011, pp. 1843-1856.

[11] Taehyoun Oh and R. Harjani, "A 5-Gb/s 2x2 MIMO Crosstalk Cancellation Scheme for High-Speed I/Os," in 2010 IEEE Custom Integrated Circuits Conference (CICC). doi: 10.1109/CICC.2010.5617599

[12] Y. Liu and S. Liu, "4-Gb/s Parallel Receivers With Adaptive Far-End Crosstalk
Cancellation," *IEEE Transactions on Circuits and Systems— II: Express Briefs*, May 2013, pp. 252-256.

[13] S.-K. Lee, H. Ha, H.-J. Park, and J.-Y. Sim, "A 5Gb/s single-ended parallel receiver with adaptive FEXT cancellation," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2012, pp. 140–141.

[14] K. Hwang and L. Kim, "A 5 Gbps 1.6 mW/Gbps/CH Adaptive Crosstalk Cancellation
Scheme With Reference-less Digital Calibration and Switched Termination Resistors for SingleEnded Parallel Interface", *IEEE Transactions on Circuits and Systems—I: Regular Papers*, Vol.
61, No. 10, Oct. 2014, pp. 3016-3024.

[15] J.F. Buckwalter and A. Hajimiri, "Cancellation of crosstalk-induced jitter," *IEEE Journal of Solid-State Circuits*, vol. 41, Mar. 2006, pp. 621-632.

[16] J. Han and M. Green, "A 2 × 50-Gb/s receiver with adaptive channel loss equalization and far-end crosstalk cancellation", *2015 IEEE International Symposium on Circuits and Systems* (*ISCAS*), pp.2387-2400, May 2015.

[17] J. Savoj and B. Razavi, "A 10-Gb/s CMOS clock and data recovery circuit with a half-rate linear phase detector", *IEEE Journal of Solid-State Circuits*, Vol. 36, Issue 5, May 2001, pp. 761-768.

[18] S. L. J. Gierkink, S. Levantino, R. C. Frye, C. Samori, and V. Boccuzzi, "A low-phase-noise
5-GHz CMOS quadrature VCO using superharmonic coupling," *IEEE J. Solid-State Circuits*, vol.
38, no. 7, July 2003, pp. 1148–1154.

[19] J. Lee and B. Razavi, "A 40-Gb/s Clock and Data Recovery Circuit in 0.18-μm CMOS Technology", *IEEE J. Solid-State Circuits*, vol. 38, no. 12, Dec. 2003, pp. 2181-2190.

[20] L. Li and M. Green, "Power optimization of an 11.75-Gb/s combined decision feedback equalizer and clock data recovery circuit in 0.18-μm CMOS," *IEEE Trans. Circuits Syst. I*, Reg. Papers, vol. 58, no. 3, Mar. 2011, pp. 441–450.