## Title

Recursive Switched-Capacitor Circuit Topologies for Miniaturized Power Conversion

## Permalink

https://escholarship.org/uc/item/1x1593r3

## Author

Salem, Loai
Publication Date
2018
Peer reviewed|Thesis/dissertation

## UNIVERSITY OF CALIFORNIA SAN DIEGO

## Recursive Switched-Capacitor Circuit Topologies for Miniaturized Power Conversion

A dissertation submitted in partial satisfaction of the requirements for the degree<br>Doctor of Philosophy<br>in<br>Electrical Engineering (Electronic Circuits and Systems)<br>by<br>Loai G. Salem<br>Professor Patrick P. Mercier, Chair

Committee in charge:

Professor Peter M. Asbeck
Professor James F. Buckwalter
Professor Gert Cauwenberghs
Professor Tajana Rosing

## Copyright

Loai G. Salem, 2018
All rights reserved.

The dissertation of Loai G. Salem is approved, and it is acceptable in quality and form for publication on microfilm and electronically:
$\qquad$
$\qquad$
$\qquad$
$\qquad$
$\qquad$
Chair

University of California San Diego

2018

DEDICATION

To my parents

## TABLE OF CONTENTS

Signature Page ..... iii
Dedication ..... iv
Table of Contents ..... V
List of Figures ..... ix
List of Tables ..... Xvi
Acknowledgements ..... xvii
Vita ..... xix
Abstract of the Dissertation ..... xxii
Chapter 1 Introduction ..... 1
1.1 What is a Power Converter? ..... 1
1.2 Switched Inductor versus Switched Capacitor Power Conversion ..... 2
1.2.1 Challenges of Switched-Capacitor Power Conversion ..... 3
1.2.2 Advantages of Switched-Capacitor Power Conversion ..... 3
1.3 Developments in This Work ..... 4
I Miniaturizing DC-to-DC Power Conversion ..... 8
Chapter 2 Recursive Switched-Capacitor DC-to-DC Converters ..... 9
2.1 Introduction ..... 9
2.2 Recursive Switched-Capacitor Topology ..... 12
2.2.1 Topology Definition and Steady-State Loss Analysis ..... 12
2.2.2 Open-Loop Power Stage Optimization ..... 17
2.3 Recursive Resolution-Reconfiguration Architecture ..... 20
2.3.1 Recursive Inter-Cell Connection ..... 20
2.3.2 Recursive Cell Slicing ..... 21
2.3.3 Inter-Cell Reconfiguration Switches ..... 23
2.4 Circuit Implementation ..... 25
2.4.1 4-Bit Power Stage Block Diagram ..... 25
2.4.2 Reconfiguration Costs ..... 26
2.4.3 Programmable-Port SC Boundary and Transfer Cells ..... 27
2.4.4 Output Voltage Regulation ..... 28
2.5 Experimental Verification ..... 30
2.6 Conclusion ..... 43
2.7 Acknowledgements ..... 43
Chapter 3 Flying Domain DC-to-DC Conversion ..... 45
3.1 Introduction ..... 45
3.2 Achieving High Power Density and Efficiency via the Flying-Domain Technique ..... 48
3.2.1 Supply Ripple Allowance and Overall Circuit Efficiency ..... 48
3.2.2 Switching a Capacitor versus Switching the Load ..... 53
3.3 State-Space Modeling of Flying-Domain DC-DC Converters ..... 56
3.3.1 Modeling a 2-to-1 Flying-Domain Converter ..... 56
3.3.2 Modeling a 4-to-1 Flying-Domain Converter ..... 59
3.4 Circuit Implementation ..... 62
3.4.1 Reconfigurable Power Stage Design ..... 62
3.4.2 Switch-Load Voltage Matching for Optimal Conductance Tracking ..... 64
3.4.3 Hysteretic Control ..... 65
3.4.4 Flying-Domain Interface Shifters ..... 66
3.5 Experimental Verification ..... 68
3.6 Conclusion ..... 74
3.7 Acknowledgements ..... 74
Chapter 4 A Switched-Capacitor Power Management Integrated Circuit ..... 76
4.1 Introduction ..... 76
4.2 Frequency-Scaled Gear-Train SC Topology ..... 77
4.3 Circuit Implementation ..... 78
4.4 Measurement Results ..... 79
4.5 Acknowledgements ..... 81
II Miniaturizing DC-to-AC Power Conversion ..... 86
Chapter 5 A Recursive House-of-Cards Power Amplifier ..... 87
5.1 Introduction ..... 87
5.2 House-of-Cards Switched-Capacitor Power Amplifier ..... 89
5.2.1 Implicit DC-DC Conversion via Stacked-Amplifier Charge- Recycling ..... 89
5.2.2 High-Voltage RF Signal Generation in Scaled CMOS without Magnetics ..... 94
5.3 Recursive HoC Amplifier Architecture ..... 99
5.3.1 HoC Digital Power Amplifier Linearization ..... 99
5.3.2 Voltage-Mode Magnetic-less Swapping Doherty for High Av- erage Efficiency ..... 102
5.3.3 Stacked-FET AM-AM and AM-PM Distortion ..... 106
5.3.4 Recursive HoC Slice Architecture ..... 109
5.4 Circuit Implementation ..... 111
5.4.1 Reconfigurable Class-D PA Cell Design ..... 111
5.4.2 Interfacing Level Shifters ..... 115
5.5 Experimental Results ..... 117
5.6 Conclusion ..... 124
5.7 Acknowledgements ..... 124
Chapter 6 Adiabatic Clocking ..... 125
6.1 Introduction ..... 125
6.2 Challenges of Resonant Clocking ..... 126
6.3 Adiabatic Switched-Capacitor Driver ..... 126
6.4 Circuit Implementation ..... 127
6.5 Measurement Results ..... 130
6.6 Acknowledgements ..... 131
III Fine-Grain Power Management ..... 135
Chapter 7 A Recursive Digital Low-Dropout Voltage Regulator ..... 136
7.1 Introduction ..... 136
7.2 Successive-Approximation Digital LDO Topology and Operation ..... 138
7.2.1 SAR LDO Architecture ..... 138
7.2.2 Performance Comparison: Speed-Power Trade-off Improve- ment via SAR Control ..... 140
7.3 Variable Coefficients Proportional-Derivative Compensation ..... 142
7.3.1 $\quad$ Stability Analysis of DLDOs using a Bode Diagram Approach 142
7.3.2 Adaptive Zero Insertion through Variable-Coefficients PD Compensation ..... 145
7.4 Sub-LSB Hysteretic PWM Control ..... 148
7.4.1 Minimum Current Limit of Linear-Search Based DLDOs ..... 148
7.4.2 Minimum Current Limit in a Binary-Search DLDO ..... 149
7.4.3 Hysteretic PWM Control ..... 150
7.5 Circuit Implementation ..... 151
7.5.1 Proportional-Derivative Compensator Implementation ..... 152
7.5.2 Regulation-Loop Interruption Logic ..... 154
7.5.3 Successive Approximation Controller ..... 158
7.6 Experimental Verification ..... 159
7.6.1 Transient Measurements ..... 159
7.6.2 Steady-State Measurements ..... 162
7.6.3 Performance Summary ..... 165
7.7 Conclusion ..... 165
7.8 Acknowledgements ..... 166
Chapter 8 A Digital Low-Dropout Voltage Regulator Employing Switched-Capacitor
Resistance ..... 167
8.1 Introduction ..... 167
8.2 Switched-Capacitor Low-Dropout Voltage Regulator ..... 169
8.3 Circuit Implementation ..... 170
8.4 Measurement Results ..... 172
8.5 Acknowledgements ..... 173
Bibliography ..... 178

## LIST OF FIGURES

| Figure 2.1: | The Recursive switched-capacitor realization of the ratios $1 / 4,3 / 8$, and $5 / 16$, |  |
| :---: | :---: | :---: |
|  | and topology pseudo-code. Each SC cell comprises two out-of-phase 2:1 SC |  |
|  | for a well-posed SC network. | 10 |
| Figure 2.2: | Charge flow through two inter-cell connections to realize the same ratio |  |
|  | 11/16 (a) non-optimal cascading (b) proposed RSC connection. Bold blocks |  |
|  | are loaded with extra charge than the corresponding blocks in (b) with RSC |  |
|  | connection. Bold arrows represent the extra loading charge. | 13 |
| Figure 2.3: | The SSL power-available metric, $M_{\text {SSL }}$, for the seven topologies at binary |  |
|  | ratios up to 5-bit resolution. The topology of the highest power-available at |  |
|  | certain ratio incurs the lowest charge-sharing loss for a given silicon area. | 16 |
| Figure 2.4: | The $R_{S S L}^{*}$ for the SP and symmetric RSC versus the binary ratios using a 1 F |  |
|  | total capacitance and for a SC converter operated at 1 Hz . |  |
| Figure 2.5: | The FSL performance metric $M_{F S L}$ of the seven topologies at binary conver- |  |
|  | sion ratios up to 5-bit resolution. | 18 |
| Figure 2.6: | Resolution reduction from 4-bit to 1-bit and 2-bit, using output selection |  |
|  | multiplexer (left) and recursive inter-cell connection (right). The dashed cells |  |
|  | are disabled when realizing lower resolutions. |  |
| Figure 2.7 | Resolution reduction from 4-bit to 3-bit and from 3-bit to 2-bit, using output |  |
|  | selection multiplexer (left) and recursive slicing with recursive inter-cell |  |
|  | connection (right). |  |
| Figure 2.8: | Two 2:1 SC cells interconnection through ratio-reconfiguration switches. $V_{\text {int }}$ |  |
|  | is the inter-cell intermediate node. |  |
| Figure 2.9: | Realization of 2-bit resolution from 3-bit resolution RSC using the same |  |
|  | ratio-reconfiguration switches. | 34 |
| Figure 2.10: | Recursive implementation block diagram of the 4-bit RSC converter. The |  |
|  | implemented RSC comprises four stages of eight cells Ci and five reconfigu- |  |
|  | ration switch blocks $R_{i, i+1}$. | 35 |
| Figure 2.11: | Boundary and transfer cells schematic. |  |
| Figure 2.12: | Boundary and transfer cells decoder truth table. | 35 |
| Figure 2.13: | Recursive switched-capacitor voltage regulator implementation, comprising |  |
|  | eight cells of binary weights and two control loops. | 36 |
| Figure 2.14: | Recursive binary search controller block diagram. | 37 |
| Figure 2.15: | Measured and model-predicted efficiency, at 2 mA fixed load current, of the |  |
|  | fabricated 4-bit RSC versus the output voltage at an input voltage of 2.5 V . |  |
|  | The measured three-ratio efficiency is at 1.86 mA current and the same input |  |
|  | voltage. | 37 |
| Figure 2.16: | Measured and model-predicted efficiency with external resistive load, model- |  |
|  | ing a digital load under DVS operation, of the three-ratio SP and the 4-bit |  |
|  | RSC across the output voltage, at an input voltage of 2.5 V . |  |


| Figure 2.17: Measured RSC and three-ratio SC switching frequency $f_{s w}$ across $V_{\text {out }}$, using |  |
| :---: | :---: |
|  | the same external resistive load in Fig. 2.16., |
| Figure 2.1 | Measured RSC efficiency versus the load current at $1 / 2$ ratio, while supplying |
|  |  |
| Figure 2 | Measured and predicted weighted-average-efficiency versus the load cur- |
|  | rent density, from 0.215-to-215mA/ $\mathrm{mm}^{2}$, for the fabricated RSC and SP in |
|  | 0.25 m bulk CMOS. |
| Figure | Coarse-controller measured transient response to a stair control volta |
|  | $V_{r e f}$. Controller transient response when strobe is activated while $V_{r e f}=2 \mathrm{~V}$, |
|  | showing the detailed ratio binary search operation. . . . . . . . . . . . . . 41 |
| Figure 3.1: | Integrated DC-DC conversion methods. |
| Figure 3.2: | Illustration showing excess power consumption from a digital load powered |
|  | by a rippled-supply. |
| Figure 3.3 | Equivalent model of a synchronous digital circuit. The origin of symbiotic |
|  | capacitance is shown in the inset. . . . |
| Figure 3.4: | A sample 2-to-1 SC DC-DC converter. (a) Circuit topology. (b) Phase 1 and |
|  | phase 2 of the voltage divider. (c) Voltage across $C_{f l y}$ and the output voltage |
|  | waveforms. |
| Figure 3.5: | SSL-FSL corner frequencies for SC and FD circuits. |
| Figure 3.6: | A 2-to-1 FD DC-DC converter. (a) Circuit topology. (b) Phase 1 and phase 2 |
|  | of the voltage divider, twigs shown with dark lines. |
| Figure | A 4-to-1 FD DC-DC converter. (a) Circuit topology. (b) The resulted four |
|  | switching phases of a properly-posed 4:1 FD converter, twigs shown with |
|  | dark lines. $V_{I N}$ is a twig in the first three phases only $\Phi_{1}, \Phi_{2}$, and $\Phi_{3}$. |
| Figure 3.8: | Schematic of the unit 2:1 FD power stage cell. |
| Figure 3.9: | Reconfiguring the implemented FD converter between the $2: 1$ and $4: 1$ modes. |
| Figure 3.10 | Plot of switch width (left axis, green dotted curves) and efficiency (right axis, |
|  | red solid curves) with and without automatic conductance tracking. . . . . |
| Figure 3.11: Transient waveforms of the hysteretic controller and block diagram of the |  |
| Figure 3.12: Schematic and example timing diagram of the proposed sampling level shifters. 66 |  |
|  |  |
| Figure 3.13: <br> Figure 3.14: | Schematic of the proposed continuous-time level shifters. . . . . . . . . . 67 |
|  | Die photograph of the test chip. |
| Figure 3.15: Measured efficiency of the 2:1 FD converter when powering an on-chip |  |
|  | inverter string. |
| Figure 3.16: Measured input and output of the cascaded inverter chain when powered by |  |
|  | the FD converter. |
| Figure 3.17: Measured frequency of the on-chip ring oscillator when powered from a |  |
|  | conventional supply and from the FD converter (a). Measured frequency |
|  | offset between the two configurations (b). . . . . . . . . . . . . . . . . . |
| Figure 3.18: Measured energy of the ring oscillator load when powered from a conven- |  |
|  |  |

Figure 3.19: Measured efficiency of the $2: 1 \mathrm{FD}$ converter when powering a ring oscillator load. ..... 71
Figure 3.20: Measured debugging I/O waveforms of the M0 processor after level shifters. ..... 72
Figure 3.21: Measured controller waveforms (a) and simulated voltage across flying load $V_{L, f l y}(\mathrm{~b})$ under a current step from $21.8 \mu \mathrm{~A}$ to 1 mA . ..... 73
Figure 4.1: Proposed frequency-scaled gear-train switching scheme illustrating how to

Figure 4.3: Switch-level block diagram of the implemented converter and switching states of the implemented CF. ..... 80
Figure 4.4: Implementation of boundary and transfer cells. ..... 81
Figure 4.5: Measured and modeled efficiency curves of the SC PMIC at $4.2 \mathrm{~V}, 3.6 \mathrm{~V}$, and 2.8 V for a DVS-modeling load resistance of $120 \Omega$. ..... 82
Figure 4.6: Measured and modeled efficiency curves of the SC PMIC at $4.2 \mathrm{~V}, 3.6 \mathrm{~V}$, and 2.8 V for a DVS-modeling load resistance of $20 \Omega$. ..... 83
Figure 4.7: Efficiency of SC PMIC versus load current. ..... 84
Figure 4.8: Die photo. ..... 84
Figure 4.9: Comparison with prior work. ..... 85
Figure 5.1: Conventional class-G operation during a 6dB back-off (a). Implicit 100\% efficiency DC-DC conversion via charge-recycling PA stacking (b). ..... 90
Figure 5.2: An example 2-stack PA operation from $V_{i n}=2 V_{D D}$. (a) Switch-level block diagram. (b) The resulted two switched networks of the HoC PA when the PM clock is high and low. (c) Differential operation for elimination of $V_{\text {int }}$ capacitance. ..... 92
Figure 5.3: Implicit DC-DC conversion via charge-recycling. The same total DC ca- pacitance, i.e. $C_{1}+C_{2}$ in Fig. 5.2 , can be equally divided between the $N-1$ intermediate nodes for the same ripple amplitude as the 2-stack PA, in the case of a non-differential operation; similar explanation for $C_{f l y}$. ..... 94
Figure 5.4: Prior schemes to realize high RF power using thin-oxide devices. (a) Parallel and series power combining schemes. (b) Digital device-stacked PAs. ..... 95
Figure 5.5: An example 2-stack HoC PA (a). Resulted phases when $\phi$ is high and low (b). 97Figure 5.6: An example 3-stack HoC PA (a). Fundamental AC PA cells and theirgate/drain voltages to enable aligned safe operation through the clampingcapacitors while performing series power combining (b)98
Figure 5.7: (a) Block diagram of a single ended HoC, actual implementation is differen-tial. (b) Top-level schematic of an HoC Slice of the 16 slices, actual slice isdifferential.99
Figure 5.8: Equivalent circuit of the implemented Switched-Capacitor HoC PA. ..... 101

| Figure 5.9: | Drain efficiency of a 5-bit SC PA and a class-G-like HoC. Comparison of the |
| :---: | :---: |
|  | drain efficiency of the proposed Doherty-like HoC (5.5) and a conventional |
|  | 5-bit SC PA with two supplies $V_{D D}$ and $2 V_{D D}(5.6)$, all at $Q_{l}$ of 0.5. . . . . 102 |
| Figure 5.10: | Reconfiguring the HoC slice transformation ratio from 1:2 to 1:1 to achieve |
|  | high-efficiency at back-off (a). Simplified equivalent circuit (b). . . . . . . 103 |
| Figure 5.11: | Equivalent circuit of the HoC while generating amplitudes between the |
|  | two transformations ratios (a). Load-pull characteristics of the HoC for |
|  | $K \leq i \leq 2 K(\mathrm{~b})$. Normalized voltages and admittances of the main and |
|  | peaking amplifiers (c). (d) Swapping Doherty illustration. . . . . . . . . . 105 |
| Figure 5.12: | Maximum distortion value due to on-resistance mismatch in a 4-bit DAC (a) |
|  | with a total conductance $G_{o n}$ of $0.1 \Omega^{-1}$ and (b) with $G_{o n}$ of $1 \Omega^{-1}$. Total $C_{c}$ |
|  | $=25 \mathrm{pF}$ and $f_{o}=1 \mathrm{GHz}$. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 |
|  | Loading of a disabled PA cell resulting in potential device voltage rating |
|  | violations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 |
| Figure 5.14: | Recursive architecture of an HoC slice. The output-side stacked two DC |
|  | capacitors in Fig. 5.7b are not shown for clarity. |
| Figure 5.15: | Reconfigurable class-D PA generic cell schematic (a). Placing each PA cell |
|  | in a separate deep n-well (b). Floating connection enables $2 \times$ reduction in |
|  | bottom-parasitics unlike the required high bias resistance used in [1]. . . . 112 |
| Figure 5.16: | Simulated overall loss optimization plots. (a) Conduction and switching loss |
|  | components at $f_{o}=720 \mathrm{MHz}$. (b) Peak-amplitude PAE and $R_{\text {out }}$ versus switch |
|  | size at $f_{o}=720 \mathrm{MHz}$. (c) Optimal peak-amplitude PAE versus $f_{o}$. . . . . . 114 |
| Figure 5.17: | Proposed balanced Dickson shifter (a). Generating the PM clocks of PA1, |
|  | PA2, PA3, and PA4 of the recursive slice in Fig. 5.14\|(b). . . . . . . . . . . 116 |
| Figure 5.18: | Chip micrograph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 |
| Figure 5.19: | Measurement setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 |
| Figure 5.20: | (a) Measured battery-to- $P_{\text {out }}$ PAE, output power $\left(P_{\text {out }}\right)$, and output voltage |
|  | amplitude versus the input code of the proposed flying-domain PA with $50 \Omega$ |
|  | antenna ( $f_{o}=720 \mathrm{MHz}$ ). (b) Measured DNL and INL of the proposed PA. . 119 |
| Figure 5.21: | Measured dynamic characteristics of the 16-QAM signal. . . . . . . . . . . 120 |
| Figure 5.22: | Measured spectrum, close-in (a) and far-out (b). . . . . . . . . . . . . . . . . 121 |
| Figure 5.23: | Measured time-domain output of the proposed PA with 32-QAM $20-\mathrm{MHz}$ |
|  | OFDM ( $f_{o}=720 \mathrm{MHz}$ ) (top) and measured AM step response for 6-step |
|  | change (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 |
| Figure 6.1: | Prior inductive resonant clocking techniques (top); the proposed switched- |
|  | capacitor multi-level adiabatic clocking technique (bottom). . . . . . . . . 127 |
| Figure 6.2: | Circuit schematics and timing diagrams of the 4-level inverter, showing how |
|  | it can be reconfigured into 3- and 2-level modes. . . . . . . . . . . . . . . 129 |
| Figure 6.3: | Schematic of the custom HoC timing gate (top); architecture of the imple- |
|  | mented test chip (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 130 |


| Figure 6.4: | Measured clock power improvement of 4- and 3-level clocking compared |
| :---: | :---: |
|  | to conventional clocking at 1 V and 0.4 V across frequency (left); measured |
|  | CLK energy-per-bit improvement of the 4-level inverter across supply voltages.\|132 |
| Figure 6.5: | Measured power savings through 4-level clocking from $1 \mathrm{MHz}-2 \mathrm{GHz}$ and |
|  | $0.4-1 \mathrm{~V}$, achieving an average efficiency of $41.8 \%$ across the entire space (top); |
|  | measured transient waveforms of the 4-level adiabatic clock at ( $10 \mathrm{MHz}, 1 \mathrm{~V}$ ). 113 |
| Figure 6.6: | Comparison of the proposed adiabatic clocking scheme versus resonant |
|  | clocking implementations. |
| Figure 6.7: | Micrograph of the fabricated chip. . . . . . . . . . . . . . . . . . . . . 134 |
| Figure 7.1: | (a) Block-level diagram of the proposed RLDO. (b) Illustrative $V_{\text {out }}$ response |
|  | to a 0 -to- $V_{\text {ref }}$ step when the rate of change of $V_{\text {out }}$ is much faster than the |
|  | clock frequency, $f_{C L K}$, to explicitly show the binary search process. |
| Figure 7.2: | Transient $V_{\text {out }}$ response of a 7-bit linear-search and binary-search DLDOs |
|  | to 0-to- $I_{L, \text { max }}$ load step. (a) close-in. (b) Far-out. Both LDOs have the same |
|  | total array conductance, $G$, where $G \Delta V_{\text {drop-out }}$ matches $I_{L, \text { max }}$. |
| Figure 7.3: | FOM of a linear-search DLDO and a binary-search RLDO normalized to the |
|  | analog LDO FOM for the same process technology, $I_{Q} / \Delta I_{L}$, and $C_{\text {out }}$ versus |
|  | the required resolution $N$. |
| Figure 7.4: | RLDO model. (a) Small-signal AC model. (b) Bode diagram. (c) RLDO |
|  | with PD controller. The SAR controller acts as a variable-gain discrete-time |
|  | integrator. |
| Figure 7.5: | (a) Transient $V_{\text {out }}$ simulations of a 7-bit RLDO with the proposed PD com- |
|  | pensation, and a 7-bit DLDO at peak current $R_{L}=1 \Omega$ and at light current |
|  | $R_{L}=25 \Omega\left(V_{\text {in }}=2 \mathrm{~V}, V_{\text {ref }}=1 \mathrm{~V}, G=5 \Omega^{-1} 1, C_{\text {out }}=1 /(2 \pi)\right)$. (b) Simulations of |
|  | the PD-compensated RLDO at $R_{L}=100 \Omega$ and $R_{L}=1000 \Omega$. |
| Figure 7.6: | DLDO transient $V_{\text {out }}$ response with and without the proposed PD compensator. 11 |
| Figure 7.7: | Setting-limits of digital LDOs minimum current. (a) Simulated steady-state |
|  | error versus $R_{L}$ at $f_{C L K}=10 \times f_{L}$ and (b) simulated steady-state $V_{\text {out }}$ ripple |
|  | $\Delta V_{\text {out }, p-p}$ versus $f_{C L K} / f_{L}$ of a 7 -bit shifter-based DLDO with $V_{\text {ref }}=V_{\text {in }} / 2$. |
| Figure 7.8: | Hysteretic dual-bound controller. (a) Top-level schematic. (b) Operation. |
| Figure 7.9: | Top-level state diagram of the proposed RLDO. |
| Figure 7.10: | Quantized gate-level implementation. (a) Equivalence of a clocked compara- |
|  | tor to a quantized AND gate. (b) Quantized gate-level implementation of the |
|  | PD compensator truth table in table 7.2 [] |
| Figure 7.11: | Top-level schematic of the implemented PD compensator, including PWM |
|  | comparators, derivative-term comparators, and bottom-plate sampling cir- |
|  | cuitry. Insets illustrate the 1st edge pass and DC correction logic. . . . . . 15 |
| Figure 7.12: | Difference accumulation to overpower KT/C noise due to a small sampling |
|  | capacitor size.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 |
| Figure 7.13: | Proposed branch-prediction (a) flowchart and (b) state-diagram implementa- |
|  | tion. Inset: the SAR reset $C N V$ selection based on EoC through an output |
|  | multiplexer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 |

Figure 7.14: State-diagram implementation of the upper bound. ..... 156
Figure 7.15: SAR backbone pseudo code. ..... 158
Figure 7.16: The RLDO simulated response to a periodic load swing between $40 \mu \mathrm{~A}$ and $200 \mu \mathrm{~A}$ within $200 \mathrm{ps}, V_{\text {in }}=0.5 \mathrm{~V}, V_{\text {ref }}=0.45 \mathrm{~V}, C_{\text {out }}=0.4 \mathrm{nF}$, and $f_{C L K}=100 \mathrm{MHz}$. 159
Figure 7.17: Die photo of fabricated RLDO. ..... 160
Figure 7.18: Measured transient response of the RLDO to periodic square-wave load current variation with $V_{I N}=1 \mathrm{~V}, V_{\text {OUT }}=0.45 \mathrm{~V}$, and $C_{\text {out }}=1 \mu \mathrm{~F}$,. ..... 161
Figure 7.19: Measured transient response of the RLDO to a periodic square-wave load current variation with $V_{I N}=0.5 \mathrm{~V}, V_{O U T}=0.45 \mathrm{~V}$, and $C_{\text {out }}=0.4 \mathrm{nF}$ (top). When $C_{\text {out }}=1 \mu \mathrm{~F}$, the RLDO remains stable during periodic positive and negative load steps (bottom). ..... 163
Figure 7.20: Measurement of output voltage ripple during PWM duty control. ..... 163
Figure 7.21: (a) Line regulation measurement at a clock frequency of 100 MHz and a load current of 1 mA . (b) Load regulation measurement at $f_{C L K}=10 \mathrm{MHz}$ and $V_{\text {in }}=0.5 \mathrm{~V}$. ..... 164
Figure 7.22: Measured current efficiency $\eta$. (a) At $V_{\text {in }}=0.5 \mathrm{~V}$ and $V_{\text {out }}=0.3 \mathrm{~V}$ ( $f_{\text {CLK }}=$ 10 MHz ), demonstrating efficiency higher than $90 \%$ from $33.6 \mu \mathrm{~A}$ to 2 mA . (b) At $V_{\text {in }}=0.5 \mathrm{~V}$ and $V_{\text {out }}=0.45 \mathrm{~V}\left(f_{\text {CLK }}=10 \mathrm{MHz}\right)$. ..... 165
Figure 7.23: Comparison of the RLDO with prior-art DLDOs. ..... 166
Figure 8.1: A conventional switch-array DLDO (left) and its accuracy problem (bottom left and right); proposed SCR-DLDO using a switched-capacitor resistance and its frequency-programmable equivalent conductance (right and bottom- right). ..... 168
Figure 8.2: Accuracy advantage of a 1V DLDO using SCR to perform D/A conversion in the time-domain versus SA-DLDOs that achieve poor-accuracy conversion in the current-domain (top); overhead current reduction due to fCLK scaling (bottom left); fast response time advantage of the proposed SCR-DLDO. . . 171
Figure 8.3: $\quad$ SCR ripple mitigation via the proposed binary ripple control scheme (top-left); top-level block diagram of the implemented SCR-DLDO (top-right);cell partitioning to implement binary ripple control, along with the schematicsof the SCR 1x unit-cell, non-overlap circuitry, and comparator (bottom). . 172
Figure 8.4: $\quad$ SCR-DLDO output voltage and current efficiency at corners Vin $=0.9 \mathrm{~V}$and Vin $=0.5 \mathrm{~V}$, demonstrating high accuracy and efficiency. The measuredSCR-DLDO output and efficiency are compared to the measured results of arecursive DLDO.174

| Figure 8.5: | Measured dynamic response of the SCR-DLDO to an on-chip periodic load- |
| :--- | :--- |
|  | step demonstrating 2.48ns response time at 36.9ps FOM (top). Ilustration of |
|  | the efficacy of the proposed binary ripple control scheme in mitigating the |
|  | SCR ripple at light loads and small Vout values (bottom)......... |

Figure 8.6: Comparison of the proposed SCR-DLDO with state-of-the-art switch-array DLDOs illustrating the smallest area, best FOM, and highest accuracy, enabling a realistic industry-compliant digital replacement to analog LDOs for
3.1mV step DVS and adaptive voltage scaling applications. . . . . . . . . 176

Figure 8.7: $\quad$ Micrograph of the fabricated SCR-DLDO chip. . . . . . . . . . . . . . . . 177

## LIST OF TABLES

Table 2.1: Comparison with Previously Published Fully-Integrated SC Converters . . . 42
Table 3.1: Table of comparisons to prior-art SC and resonant converters achieving high power density or efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Table 5.1: Comparison with prior work . . . . . . . . . . . . . . . . . . . . . . . . . 123
Table 7.1: PD Control Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Table 7.2: Truth table of PD compensator . . . . . . . . . . . . . . . . . . . . . . . . 152

## ACKNOWLEDGEMENTS

The material in this dissertation is based on the following published papers.
Chapter 2 is based on and mostly a reprint of the following publications:

- L.G. Salem and P.P. Mercier, "A recursive switched-capacitor DC-DC converter achieving $2^{N}-1$ ratios with high efficiency over a wide output voltage range," in IEEE Journal of Solid-State Circuits (JSSC), Dec. 2014, vol. 49, no. 12, pp. 2773-2787.
- L.G. Salem and P.P. Mercier, "An 85\%-efficiency fully-integrated 15-ratio recursive switched-capacitor DC-DC converter with 0.1-2.2V output voltage range," 2014 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2014, pp. 88-89.

Chapter 3 is based on and mostly a reprint of the following publications:

- L.G. Salem, J.G. Louie, and P.P. Mercier, "Flying-domain DC-DC power conversion," IEEE Journal of Solid-State Circuits (JSSC), Dec. 2016, vol. 51, no. 12, pp. 2830-2842. - L.G. Salem, J.G. Louie, and P.P. Mercier, "A flying-domain DC-DC converter powering a Cortex-M0 processor with $90.8 \%$ efficiency," 2016 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2016, pp. 234-236.

Chapter 4 is based on and mostly a reprint of the following publication:
L.G. Salem and P.P. Mercier, "A battery-connected 24-ratio switched capacitor PMIC achieving 95.5\%-efficiency," 2015 IEEE Symposium on VLSI Circuits, Jun. 2015, pp. C340-C341.

Chapter 5 is based on and mostly a reprint of the following publications:

- L.G. Salem, J.F. Buckwalter, and P.P. Mercier, "A recursive switched-capacitor house-of-cards power amplifier," IEEE Journal of Solid-State Circuits (JSSC), Jul. 2017, vol. 52, no.7, pp. 1719-1738.
- L.G. Salem, J.F. Buckwalter, and P.P. Mercier, "A recursive house-of-cards digital power amplifier employing a /4-less Doherty power combiner in 65 nm CMOS," in Proc.

IEEE European Solid-State Circuits Conference (ESSCIRC), Sep. 2016, pp. 189-192.
Chapter 6 is based on and mostly a reprint of the following publication:
L.G. Salem and P.P. Mercier, "A 0.4-1V 1MHz-to-2GHz switched-capacitor adiabatic clock driver achieving $55.6 \%$ clock power reduction," 2017 IEEE International SolidState Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2017, pp. 442-443. Chapter 7 is based on and mostly a reprint of the following publications:

- L.G. Salem, J. Warchall, and P.P. Mercier, "A successive-approximation low drop-out voltage regulator," IEEE Journal of Solid-State Circuits (JSSC), Jan. 2018, vol. 53, no. 1, pp. 35-49.
- L.G. Salem, J. Warchall, and P.P. Mercier, "A 100nA-2mA Successive-Approximation digital LDO with PD compensation and sub-LSB duty control achieving a 15.1 ns response-time at 0.5 V, ," 2017 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2017, pp. 340-341.

Chapter 8 is based on and mostly a reprint of the following publication:
L.G. Salem and P.P. Mercier, "A sub-1.55mV accuracy 36.9ps FOM digital low-dropout regulator employing switched-capacitor resistance," 2018 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2018.

The dissertation author is the primary author of the work in these chapters, and coauthors (Prof. Patrick P. Mercier, Prof. James F. Buckwalter, John Louie, and Julian Warchall) have approved the use of the material for this dissertation.

Loai G. Salem
La Jolla, California
May, 2018 Egypt

2009-2011
2013-2018
M. Sc. in Microelectronics System Design, Nile University, Egypt

Ph. D. in Electrical Engineering, University of California, San Diego

## PUBLICATIONS

L.G. Salem, J. Warchall, and P.P. Mercier, "A successive-approximation low drop-out voltage regulator," IEEE Journal of Solid-State Circuits (JSSC), invited paper to the special issue on ISSCC 2017, Jan. 2018, vol. 53, no. 1, pp. 35-49.
L.G. Salem, J.F. Buckwalter, and P.P. Mercier, "A recursive switched-capacitor house-of-cards power amplifier," IEEE Journal of Solid-State Circuits (JSSC), Jul. 2017, vol. 52, no.7, pp. 1719-1738.
L.G. Salem, J.G. Louie, and P.P. Mercier, "Flying-domain DC-DC power conversion," IEEE Journal of Solid-State Circuits (JSSC), Dec. 2016, vol. 51, no. 12, pp. 2830-2842.
L.G. Salem and P.P. Mercier, "A recursive switched-capacitor DC-DC converter achieving 2N-1 ratios with high efficiency over a wide output voltage range," in IEEE Journal of Solid-State Circuits (JSSC), Dec. 2014, vol. 49, no. 12, pp. 2773-2787.
D.-G. Lee, L.G. Salem, and P.P. Mercier, "Narrowband Transmitters: ultra-low-power design," in IEEE Microwave Magazine, invited, vol. 16, no. 3, Apr. 2015, pp 130-142.
L.G. Salem and P.P. Mercier, "A sub-1.55mV accuracy 36.9ps FOM digital low-dropout regulator employing switched-capacitor resistance," 2018 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2018.
L.G. Salem, J. Warchall, and P.P. Mercier, "A 100nA-2mA Successive-Approximation digital LDO with PD compensation and sub-LSB duty control achieving a 15.1 ns response-time at $0.5 \mathrm{~V},{ }^{\prime} 2017$ IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2017, pp. 340-341.
L.G. Salem and P.P. Mercier, "A 0.4-1V 1MHz-to-2GHz switched-capacitor adiabatic clock driver achieving $55.6 \%$ clock power reduction," 2017 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2017, pp. 442-443.
L.G. Salem, J.G. Louie, and P.P. Mercier, "A flying-domain DC-DC converter powering a Cortex-M0 processor with $90.8 \%$ efficiency," 2016 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2016, pp. 234-236.
L.G. Salem and P.P. Mercier, "An 85\%-efficiency fully-integrated 15-ratio recursive switchedcapacitor DC-DC converter with 0.1-2.2V output voltage range," 2014 IEEE International SolidState Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2014, pp. 88-89.
L.G. Salem and P.P. Mercier, "A battery-connected 24-ratio switched capacitor PMIC achieving 95.5\%-efficiency," 2015 IEEE Symposium on VLSI Circuits, Jun. 2015, pp. C340-C341.
L.G. Salem and P.P. Mercier, "A single-inductor 7+7 ratio reconfigurable resonant switchedcapacitor DC-DC converter with 0.1-to-1.5V output voltage range," in Proc. IEEE Custom Integrated Circuits Conference (CICC), Sep. 2015, pp. 1-4.
L.G. Salem and P.P. Mercier, "A 45-ratio recursively sliced series-parallel switched-capacitor DCDC converter achieving 86\% efficiency," in Proc. IEEE Custom Integrated Circuits Conference (CICC), Sep. 2014, pp. 1-4.
L.G. Salem, J.F. Buckwalter, and P.P. Mercier, "A recursive house-of-cards digital power amplifier employing a $\lambda / 4$-less Doherty power combiner in 65 nm CMOS," in Proc. IEEE European Solid-State Circuits Conference (ESSCIRC), Sep. 2016, pp. 189-192.
L.G. Salem and P.P. Mercier, "A footprint-constrained efficiency roadmap for on-chip switchedcapacitor DC-DC converters," in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), May 2015, pp. 2321-2324.
L.G. Salem and Y.I. Ismail, "Switched-capacitor DC-DC converters with output inductive filter," in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), May 2012, pp. 444447.
L.G. Salem and Y.I. Ismail, "Slow-switching-limit loss removal in SC DC-DC converters using adiabatic charging," in 2011 International Conference on Energy Aware Computing, Dec 2011, pp. 1-4.
L.G. Salem and Y.I. Ismail, "All-Digital comparator using device-ratio triggering levels," in 2011 International Conference on Energy Aware Computing, Dec 2011, pp. 1-4.
L.G. Salem and Y.I. Ismail, "Fast hysteretic control of on-chip multi-phase switched-capacitor dc-dc converters," in Proc. IEEE International Symposium of Circuits and Systems (ISCAS), May 2011, pp. 2561-2564.
L.G. Salem and R. Jain, "A Novel control technique to eliminate output-voltage-ripple in switched-
capacitor DC-DC converters," in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), May 2011, pp. 825-828.
L.G. Salem and Y.I. Ismail, "Fully integrated fast response Switched-Capacitor DC-DC converter using reconfigurable interleaving," in 2010 International Conference on Energy Aware Computing, Dec. 2010, pp. 1-4.
L.G. Salem, R. Jain, M. Ghoneima, and Y.I. Ismail, "Parallel feedback compensation for LDO voltage regulators," in 2010 International Conference on Energy Aware Computing, Dec. 2010, pp. 1-4.
L.G. Salem and Y.I. Ismail, "Gain-band self-clocked comparator for DC-DC converters hysteretic control," in 2010 International Conference on Energy Aware Computing, Dec. 2010, pp. 1-4.

# ABSTRACT OF THE DISSERTATION 

# Recursive Switched-Capacitor Circuit Topologies for Miniaturized Power Conversion 

by

Loai G. Salem

Doctor of Philosophy in Electrical Engineering (Electronic Circuits and Systems)

University of California San Diego, 2018

Professor Patrick P. Mercier, Chair

Power conversion circuits are essential components in every electronic device from laptops and cellphones to wearable and implantable devices. These circuits convert the DC voltage of the battery or energy source to the appropriate DC level or AC form required by a load. The key challenge of this class of circuits is that they often define the size and energy of an electronic system. This is since the volume-size and quality-factor of the required inductors render power conversion circuits disproportionately large and lossy to the rest of the electronic system.

This work aims to make power conversion circuits smaller without losing power efficiency by developing new circuit topologies that primarily rely on capacitors instead of inductors. The developed switched capacitor DC-to-DC and DC-to-AC converters rely on architecting the
governing current and voltage equations such that they follow efficient mathematical formulae to enable the minimum loss and/or the smallest volume size as compared to prior topologies. The developed topologies thus enable switched capacitor converters to be as efficient as the industry's flagship inductor-based solutions, however in orders-of-magnitude smaller size. Moving forward, such an approach of relying on electric instead of magnetic energy transfer enables the exploitation of fabrication-technology feature-size scaling to shrink the total electronic system size beyond what is feasible through today's technologies.

## Chapter 1

## Introduction

### 1.1 What is a Power Converter?

A power converter is a circuit that converts the level and/or form (whether DC or AC) of an input voltage to a desired target voltage while preserving the output power equal (or close) to the input power. In mechanical terms, a power converter acts as a valve that changes a fluid velocity at its ends while preserving the fluid mass (i.e. without leaking the fluid). In this sense, a linear voltage regulator is not considered as a power converter. This is since a linear voltage regulator steps down the input voltage by burning the energy of the excess voltage difference as heat in a series resistance. In mechanical analogy, this is equivalent to reducing the output fluid flow rate by creating a cut in the carrying pipe to get rid of the excess flow.

Power converters are essential elements in almost every electronic device. As discussed, they can step down/up the level of the input voltage and thus can control the amount of electric power delivered to a resistive load. This power control capability is of critical importance in present integrated circuit industry. Specifically, power management circuits control the flow of electric energy from the system battery or energy source to the supplied circuitry. This is similar to the role of the gas pedal in a vehicle where connecting the intake of the employed gasoline
engine directly to the fuel tank results in a huge waste of energy.
Additionally, power converters change the input electric power from a DC form into an AC form and vice versa. A power converter can step down/up the input DC or AC voltage to a target DC or AC voltage level at the output, respectively, in order to realize the required power control. This is essential to the functionality of the supplied load. For example, a 60 Hz power inverter converts the input DC electric power from a solar cell to an output power of alternating currents to supply home appliances. Furthermore, the amplitude of the resulted output voltage in many cases has to be precisely controlled to a target reference level. For instance, a power amplifier converts the input DC power from the supply to an output AC (or RF) power of an amplitude controlled in accordance with the input RF signal; otherwise, the transmitted power cannot be modulated according to the carried data.

### 1.2 Switched Inductor versus Switched Capacitor Power Conversion

Power converters employ switches and energy storage elements, i.e. inductors/transformers and capacitors to perform the required conversion. Unlike resistors, inductors and capacitors can result in a voltage drop while storing the energy of the voltage difference between the input and output ports instead of dissipating it. Additionally, an energy storage element can produce a voltage step up which cannot be achieved using a resistor. Therefore, these elements enable power converters to step down/up the input voltage while achieving high efficiency. However, as a reactive element, an inductor (capacitor) requires a change in current (voltage) in order to shuttle energy. Since typically an input DC voltage source is employed or an output DC level is required, switches are essential elements in every power converter to allow the reactive elements to store energy and release it back to the load. Unfortunately, the non-idealities of the energy storage elements as well as the equivalent on-resistance of the switches introduce power losses
and thereby degrade the output-to-input power ratio or efficiency below $100 \%$.

### 1.2.1 Challenges of Switched-Capacitor Power Conversion

Most power conversion circuits in the industry rely on inductors or transformers for energy shuttling. Unlike the switched-inductor (SL) DC-DC approach, switched-capacitor (SC) DC-DC converters can achieve high efficiency only at discrete input-to-output conversion ratios. For example, a $2: 1 \mathrm{SC}$ is almost $100 \%$ efficient in dividing the input battery voltage $\left(V_{B A T}\right)$ by 2 while its efficiency rapidly falls at other voltages. This drawback has plagued the capacitive approach since the early 1970's and made it impractical to use in the $\$ 40$-Billion power management market despite its inherent suitability for miniaturization in CMOS processes. It stands to reason that employing more ratios, through multiple SC networks, at equidistant conversion steps could provide a semi-flat efficiency profile versus voltage, just like the inductive approach. Unfortunately, using prior SC topologies, the efficiency of the ratios of higher resolution than 2-bit falls significantly below the 2-bit linear-regulation efficiency i.e. when using a lossy resistance in series with the 2-bit networks to provide $V_{\text {out }}$ between the fixed $4: 1,2: 1,4: 3$ ratios. This is because the charge-sharing loss increases almost exponentially with the resolution $(N)$ in prior SC topologies e.g. 3-bit ratios' loss is $7 \times$ the 2 -bit loss. Therefore, the SC efficiency profile versus the voltage traditionally has peaks at 3 discrete ratios with steep $30 \%$-deep valleys in between, as a saw-tooth, which is not suitable for portable consumer applications.

### 1.2.2 Advantages of Switched-Capacitor Power Conversion

Present power conversion often defines the size and energy of an electronic system, where the volume-size and quality-factor of the required large inductors render power conversion circuits disproportionately large and lossy to the rest of the electronic system. In fact, power conversion circuits are becoming the bottleneck for the system size and energy in the integrated circuit
industry. With the increasing number of components inside a System-on-Chip (SoC), the required per-block power management to leash the soaring levels of chip power drives the need for the full integration of DC-to-DC and DC-to-AC conversion to avoid increased cost and limited bandwidth of the supply control. Unfortunately, on-die inductors not only occupy large area but also suffer from a large series resistance (ESR) and losses to realize a fully-integrated switched-inductor (SL) DC-DC solution with acceptable level of efficiency.

On the other hand, switched-capacitor converters have the potential to make power conversion smaller in size. Capacitors can store the same amount of energy as an inductor but in approximately two orders of magnitude smaller volume. Theoretically, the inductance of a structure is proportional to the area that the magnetic flux crosses. On the other hand, the capacitance of a structure is inversely proportional to the separation between the conducting plates. Therefore, to a first order, increasing the capacitance results in a smaller volume while increasing the inductance results in a larger one.

### 1.3 Developments in This Work

The key challenge of switched-capacitor (SC) power converters is that their energy conversion is only efficient at a few discrete input-to-output voltage ratios unlike the continuous switched-inductor (SL) conversion. This work addresses this challenge by developing new DC-to-DC/RF circuit topologies that enable a large number of input-to-output conversion ratios at arbitrarily small steps to reproduce the SL continuous capability without sacrificing efficiency or size.

The first part of this work is concerned with DC-to-DC conversion and introduces new SC topologies that enable a practical replacement to the bulky SL DC-to-DC conversion. Chapter 2 introduces a recursive binary topology by reproducing the recursive formula in a switched network form. Unlike prior SC topologies of divergent loss, the charge-sharing loss in the
recursive topology follows a convergent geometric binary series with $N$ that settles at the absolute minimum loss value of $1.67 \times$ the 2 -bit ratios' loss at arbitrarily high resolutions. Therefore, the topology scores the highest efficiency among prior SC topologies in integrated applications. Via the recursive topology, the higher resolution ratios can effectively fill the $30 \%$ efficiency gaps in between the 2-bit ratios, enabling semi-continuous profile like the inductive approach and hence a CMOS-compatible replacement for the SL converters. Additionally, chapter 3 develops a recursive flying-domain DC-DC converter that achieves voltage-conversion not by switching a size-consuming passive element (inductor or capacitor), but rather, by switching the load itself to achieve high power density and efficiency.

Chapter 4 is concerned with enabling capacitive DC-DC conversion for high power levels. For high power, it is more economical to implement the required capacitors as discrete elements using other technologies instead of CMOS or SOI. A baseline discrete capacitor has $25 \times$ smaller footprint, $5 \times$ smaller thickness, $21 \times$ lower cost, and more importantly, $700 \times$ lighter weight than an inductor for the same energy storage capability. It stands to reason that a competitive power management integrated circuit (PMIC) can be realized by using capacitors. However, the industry-standard SC charge-pump TPS6050x from Texas Instruments, and similar ICs from Analog Devices, Maxim Integrated among others, employs only three ratios and thus its efficiency peaks at the three ratios and rapidly falls at other $V_{\text {out }}$ values as a saw tooth. To replicate the inductive flat-efficiency profile, 31 ratios at equidistant steps of 5-bit resolution are needed. Realizing this large number of ratios with the industry-standard topology requires 31 capacitors which occupies about $2 \times$ larger board space and a humongous assembly cost than the single inductor in the SL approach. In contrast, chapter 4 shows that by operating the switches of the various stages of a SC converter at binary-weighted frequencies instead of the traditional approach of having them all at the same operating frequency enables the smallest number of capacitors to realize a given ratio. By operating the stages at different frequencies, new switched-network phases result which impose extra Kirchhoff's Voltage Law (KVL) constraints and thereby some
capacitors can be safely removed without compromising the unique solution of the KVL system i.e. the switched network reaches a valid steady-state output voltage. Through the proposed Frequency-Scaled Gear-Train topology, only N capacitors are required for N -bit ratios. This enables a practical replacement for the traditional SL PMIC of $1.5 \times$ smaller footprint, $4 \times$ lower cost, and more importantly, $5 \times$ smaller thickness and $123 \times$ lighter weight.

The second part of this work is concerned with enabling a capacitive replacement to the contemporary inductive DC-to-AC/RF power conversion. In chapter 5, a new SC power amplifier (PA) topology is developed that can produce a large output power at a high RF voltage amplitude while employing low-voltage transistors. Current PA research endeavors to realize the required high output power by increasing the output current through size-consuming inductive transformation networks. This is because the employed PA switches cannot tolerate high voltages (HVs) in advanced CMOS fabrication technologies. Unfortunately, producing large power levels at high currents results in quadratically increasing resistive losses. Instead, the new SC topology stacks entire PA unit cells on top of each other to block higher voltages than what a single transistor can tolerate. Then, by switching the supply and ground of a stack of N-i-1 PA cells through a stack of $\mathrm{N}-\mathrm{i}$ cells, for i from 0 to $\mathrm{N}-2$, a 1:N DC-to-RF conversion ratio can be realized to produce a high power RF signal using low-voltage transistors.

Chapter 6 introduces SC based adiabatic clocking. Clock distribution in a modern multicore processor often consumes more than one-third of the total power budget. For more than two decades, IBM among other companies have pursued a higher market position by reducing the dominant clock power using resonant clocking schemes. Size-consuming inductors are employed to tune-out the clock equivalent capacitance at a single target frequency. The POWER8 server processor from IBM employs 114 on-chip inductors to achieve $33 \%$ clock power reduction. Instead, chapter 6 develops a DC-to-AC topology that reduces the clock power by a factor of N across an arbitrarily wide range of clock frequencies and without using inductors. The topology sequentially charges and discharges the clock capacitance through an N -step waveform within
¡ $18 \%$ of the clock cycle time which reduces clock power by $\mathrm{N} \times$. The prototyped IC implements a 3-step clock driver and achieves $55.6 \%$ clock power reduction across $1 \mathrm{M}-2 \mathrm{GHz}$ without using inductors, enabling up to $200 \times$ reduction in size.

The third part of this work introduces new all-digital low dropout voltage (LDO) regulator circuit topologies. All-digital LDOs suffer from an output voltage of slow response time and low accuracy which have hindered their deployment in consumer electronics. Towards this problem, chapter 7 develops the first all-digital LDO that is faster than an analog LDO, under the same power consumption, via a new recursive LDO topology. Further, chapter 8 contributs a new class of digital LDOs that replaces the traditional programmable switch-array, the main actuator in an LDO, with a switched-capacitor bilinear resistance to regulate the output voltage. This enables digital LDOs to be as accurate as their analog counterparts, achieving a realistic digital replacement to analog LDOs.

## Part I

## Miniaturizing DC-to-DC Power

## Conversion

## Chapter 2

## Recursive Switched-Capacitor DC-to-DC

## Converters

### 2.1 Introduction

Todays digital integrated circuits achieve a balance between performance and energy efficiency through dynamic voltage scaling (DVS) of individual processing cores in accordance with performance needs. As the number of voltage domains increases in todays system-onchips (SoCs), generation of each supply voltage must occur not only efficiently, but within a small area. While linear regulators are compact and achieve fast response times [2, 3], their efficiencies are determined by the ratio of output to input voltage, potentially limiting systemlevel energy-efficiency [4, 5, 6]. On the other hand, switched-inductor DC-DC converters can achieve high efficiencies, yet typically require large off-chip inductors [7] or increased packaging complexity [8, 9, 10, 11, 12], limiting their ability to power many independent voltage domains in a small volume. To simultaneously address the efficiency/size trade-off, fullyintegrated switched capacitor (SC) DC-DC converters utilize high- $Q$ capacitors available in typical CMOS processes to convert and regulate power in an energy- and area-efficient manner

(a) Examples of Recursive SC topology.

(b) Recursive SC topology pseudo-code generator.

Figure 2.1: The Recursive switched-capacitor realization of the ratios $1 / 4,3 / 8$, and $5 / 16$, and topology pseudo-code. Each SC cell comprises two out-of-phase 2:1 SC for a well-posed SC network.

Unlike switched-inductor DC-DC converters, however, SC converters are only efficient at discrete ratios of input-to-output voltages, constricting efficient DVS operation to small supply voltage ranges. Increasing the number of reconfigurable ratios can solve this; however, doing so introduces two main challenges: capacitance utilization and relative sizing. In a fully integrated SC
converter, the achievable efficiency is limited by the amount of committed capacitance; disabling even a small fraction of such capacitance significantly lowers the efficiency. Additionally, ensuring optimal relative sizing among the constituent capacitors can improve efficiency considerably [21]. Unfortunately, the complexity of conventional topologies, including the number of necessary capacitors and reconfigurations switches, increases significantly with the number of ratios, making simultaneous $100 \%$ capacitance utilization and optimal relative sizing extremely challenging. Thus, most SC converter designs employ only a small number of ratios [22, 13, 17, 23, 24], often resulting in large efficiency drops in-between the available ratios.

This chapter presents the first demonstration of a SC converter that is reconfigurable amongst $2^{N}-1$ ratios without disconnecting a single capacitor while ensuring optimal relative sizing and high efficiency across a large output voltage range. The proposed Recursive SC DC-DC converter topology (RSC) [25], shown in Fig. 2.1, recursively divides the delivered output charge across $N 2$ : 1 cells connected in cascade to generate $N$-bit ratios. By maximizing the number of input voltage and ground connections, charge-sharing losses are minimized, and in fact become a convergent geometric series with minimal additional losses incurred beyond 4-bit ratios. Given the inherently modular nature of the converter, $100 \%$ capacitance utilization is ensured by reconfiguring cell connections either in cascade (for high resolutions) or parallel (for lower resolutions), with binary slicing of the largest cascaded cell in order to enable reconfiguration amongst odd and even resolutions, all while ensuring optimal relative sizing.

This chapter is organized as follows: Section II introduces the RSC topology and discusses its theoretical performance compared with prior topologies. Section III presents architectural implementation details of a 4-bit RSC converter, while Section IV presents detailed circuit design. Experimental results of the test chip that verify the predicted performance are provided in Section V.

### 2.2 Recursive Switched-Capacitor Topology

The most basic RSC building block is a 2:1 SC converter. As shown in Fig. 2.1.a, a 2:1 SC can be considered as a three-port circuit that includes two input ports $I N_{\text {top }}$ and $I N_{\text {bottom }}$ to receive a high and low input voltages, respectively, and an output port MID that provides the average of the voltages at the input ports, i.e. $\left(I N_{t o p}+I N_{\text {bottom }}\right) / 2$. The $2: 1 \mathrm{SC}$ cell equally loads its output port current on the two input ports $\left(I N_{\text {top }}, I N_{\text {bottom }}\right)$. The following subsections discuss how 2:1 SC building cells can be connected to realize $2^{N}-1$ conversion ratios while minimizing losses.

### 2.2.1 Topology Definition and Steady-State Loss Analysis

Figure 2.1b shows the Recursive SC topology pseudo-code. Starting with a single 2:1 SC that divides the converter input voltage, $V_{i n}$, into two intervals ( 0 -to- $V_{i n} / 2, V_{i n} / 2$-to- $V_{\text {in }}$ ), the topology inserts a 2:1 SC cell in series between the previous cell output MID and the converter ground 0 , or stacked between $V_{\text {in }}$ and the output MID of the previous 2:1 cell, repeatedly, until the desired binary conversion ratio $m / 2^{N}$ is realized, where $m<2^{N}$. Figure 2.1a demonstrates examples of the ratios $1 / 4,3 / 8$, and $5 / 16$ at 2 -, 3 -, and 4 -bit resolutions, respectively.

The proposed topology minimizes cascaded losses by maximizing the number of input voltage, $V_{i n}$, and ground, 0, connections. Specifically, each 2:1 stage $C i$ has at least one input port connected to either the input voltage $V_{\text {in }}$ or the converter ground 0 , and thus each stage loads half of its output charge $q_{i}$ on the input supply, $V_{i n}$, or ground, 0 , instead of loading such charge on a previous cascaded stage. For example, Fig. 2.2. illustrates two different configurations that both realize an $11 / 16$ ratio. In Fig. 2.2a, the last stage $C 4$ loads half of the output charge $q_{\text {out }}$ on the $2^{\text {nd }}$ stage, $C 2$, which in turn loads the first stage, $C 0$, with $3 q_{\text {out }} / 8$. The third stage, $C 3$, loads the first stage by an additional $q_{\text {out }} / 4$, and thus the total charge delivered by the first stage is $5 q_{\text {out }} / 8$. In contrast, the RSC converter employs the configuration shown in Fig. 2.2b, where the $I N_{\text {top }}$
of $C 4$ and the $I N_{\text {bottom }}$ of $C 3$ are directly connected to the converter input, $V_{i n}$, and the ground, 0 , respectively, and therefore, the loaded charge on $C 2$ and $C 1$ are both reduced by $q_{\text {out }} / 2$. For an arbitrary recursion depth $N$, each stage is loaded with a charge $q_{i}$ that is divided by a binary weight of the total output charge, $q_{\text {out }}$, such that $q_{i}=q_{\text {out }} / 2^{N-i}$, where $i$ is the stage order in the cascade.

(a) Non-optimal cascading connection. (b) RSC optimal connection, minimal inter-stage loading.

Figure 2.2: Charge flow through two inter-cell connections to realize the same ratio $11 / 16$ (a) non-optimal cascading (b) proposed RSC connection. Bold blocks are loaded with extra charge than the corresponding blocks in (b) with RSC connection. Bold arrows represent the extra loading charge.

It is known that the intrinsic loss mechanisms in a SC converter can be modeled by a finite output resistance, $R_{\text {out }}$ in either the slow or fast-switching limit (SSL or FSL, respectively): $R_{S S L}$, where the charge-sharing loss dominates, and $R_{F S L}$, where the switches' on-resistance dominates the losses [21, 26]. In the SSL, the total energy loss through the converter can be found by adding the charge-sharing loss across each capacitor $C_{i},\left(q_{i} / 2\right)^{2} / C_{i}$, and by normalizing the charge-sharing power loss by the squared output current $I_{L}^{2}$, i.e. $\left(q_{o u t} f_{s w}\right)^{2}$, the equivalent output resistance $R_{S S L}$ can be calculated as:

$$
\begin{equation*}
R_{S S L}=\sum_{i=1}^{N}\left(\frac{1}{2^{N-i+1}}\right)^{2} \frac{1}{f_{s w} C_{i}}, \tag{2.1}
\end{equation*}
$$

where $C_{i}$ is the total capacitance of the two flying capacitors per stage. The derived $R_{S S L}$ is for a symmetric RSC, where each cell consists of two oppositely-phased $2: 1 \mathrm{SC}$, which eliminates any charge-balance DC capacitor between the cascaded stages. Similarly, at the FSL, the current through each switch becomes the delivered current by that stage, which is a binary weighted fraction of the load current, $I_{L}$ (i.e., $I_{L} / 2^{N-i}$ ). Thus, the equivalent output resistance $R_{F S L}$, for a $50 \%$ duty-cycle converter clock, is:

$$
\begin{equation*}
R_{F S L}=\sum_{i=1}^{N} \sum_{j=1}^{4} \frac{1}{2}\left(\frac{1}{2^{N-i}}\right)^{2} R_{i, j} \tag{2.2}
\end{equation*}
$$

where the summation over $j$ accounts for the four switches per stage $i$, and each switch resistance $R_{i, j}$ results from two parallel switches in a symmetric RSC of eight switches. The total equivalent output resistance $R_{\text {out }}$ at a given switching frequency, $f_{\text {sw }}$, occurring between the two asymptotes can be approximated by the Euclidean norm of the two limits, $R_{S S L}$ and $R_{F S L}$ [21]. From Eqns. 2.1 and 2.2, the RSC equivalent output resistance $R_{\text {out }}$ only depends on $N$ and does not change across the resolution ratios.

Allocating a larger capacitance, $C_{i}$, for each stage results in a lower voltage swing, $\Delta V_{i}$, and lower charge-sharing loss, as dictated by Eqn. 2.1. Given the limited available capacitance in a fully-integrated SC converter, it is important to find the relative sizing of each stage capacitance $C_{i}$ from the total available on-die capacitance $C_{t o t}$ to realize the minimal $R_{S S L}$. For fully-integrated capacitors with single-voltage-rating and with no stacking of switches to block higher voltages, the optimal capacitance and conductance relative-sizing match the relative charge transferred through each capacitor or switch, [21, 27, 28], and hence is binary weighted of the total available capacitance $C_{t o t}$ and conductance $G_{t o t}$ :

$$
\begin{align*}
C_{i} & =\left(\frac{2^{i-1}}{2^{N}-1}\right) C_{t o t}  \tag{2.3}\\
G_{i} & =\frac{1}{4}\left(\frac{2^{i-1}}{2^{N}-1}\right) G_{t o t} . \tag{2.4}
\end{align*}
$$

With such optimal sizing, the equivalent output impedance at the two asymptotes can be found as:

$$
\begin{gather*}
R_{S S L}^{*}=\frac{1}{f_{s w} C_{t o t}}\left(1-\frac{1}{2^{N}}\right)^{2}  \tag{2.5}\\
R_{F S L}^{*}=\frac{2}{G_{t o t}}\left(1-\frac{1}{2^{N}}\right)^{2} \tag{2.6}
\end{gather*}
$$

To realize the highest possible efficiency for a given silicon area, it is desired to select the SC topology that incurs the lowest charge-sharing loss, $R_{S S L}$, to deliver the same $q_{\text {out }}$ and conversion ratio. The power-available from a SC converter normalized by the power-available from a $2: 1 \mathrm{SC}$, using the same silicon area, can be used as a metric to compare various SC topologies in the SSL and FSL. After assigning the capacitors appropriate optimal relative sizing, the SSL normalized power-available from a topology at a conversion ratio $m / n$ becomes $M_{S S L}=\left(m / n / \sum_{i} a_{c, i}\right)^{2}$, where $m<n$ and $a_{c, i}$ is the fraction of the output charge $q_{\text {out }}$ that flows through the capacitor $C_{i}$.

Figure 2.3 compares five conventional SC topologies [29, 30, 31], as well as a Successive Approximation (SAR) SC converter [32] and the proposed RSC converter, using the established SSL metric, $M_{S S L}$, where the capacitors of each topology are assigned the optimal relative-sizing. The charge multiplier vectors of the various topologies can be found in [33] and through the analysis in [21]. The topologies are compared up to 5-bit binary conversion ratios. The SP topology $M_{S S L}$ is also shown at the ratios $1 / 6,1 / 5,2 / 7,1 / 3,2 / 5,3 / 7$, while The Fibonacci topology $M_{S S L}$ is shown for the Fibonacci series ratios $1 / 21,1 / 13, \ldots, 1 / 2$. All topologies, with the exception of the Ladder topology, have the same $M_{S S L}$ at the minimum and maximum
conversion ratios within each resolution, e.g. $1 / 2,1 / 4,1 / 8,1 / 16,1 / 32$ and $3 / 4,7 / 8,15 / 16$, $31 / 32$, respectively. As shown in Fig. 2.3b, Due to the binary division of the output charge across the various stages, the RSC cascading loss converges to an upper limit, $1 /\left(f_{s w} C_{t o t}\right)$, at large resolutions $N$, without further $M_{S S L}$ degradation. The other topologies exhibit an $M_{S S L}$ eye opening with higher resolutions $N$ for ratios $m_{o d d} / 2^{N}$, where the SSL loss becomes the summation of a divergent series.

(a) Power-available metric for the seven topologies.

(b) Power-available metric for SP, symmetric RSC, and symmetric SAR topologies.

Figure 2.3: The SSL power-available metric, $M_{S S L}$, for the seven topologies at binary ratios up to 5-bit resolution. The topology of the highest power-available at certain ratio incurs the lowest charge-sharing loss for a given silicon area.

Figure 2.4 shows the $R_{S S L}$, using a 1 F total capacitance and at 1 Hz switching frequency for the SP and the RSC topologies, with capacitors of optimal relative-sizing, across binary ratios up to 5-bit resolutions. The RSC normalized $R_{S S L}^{*}$ saturates at an upper limit of $4 \times R_{S S L}$ of a 1/2 ratio. Figure 2.5 shows the FSL optimal-voltage metric [21] for the seven topologies at the same binary ratios as previously discussed. In general for fully-integrated converters, capacitors consume most of the die area, and thus topologies that achieve the lowest SSL loss for a given silicon area (i.e., topologies with the highest $M_{S S L}$ ) are desired.


Figure 2.4: The $R_{S S L}^{*}$ for the SP and symmetric RSC versus the binary ratios using a 1 F total capacitance and for a SC converter operated at 1 Hz .

### 2.2.2 Open-Loop Power Stage Optimization

After defining the optimal relative sizing of individual RSC components, it is critical to select the total switch area $A_{s w}$ and switching frequency $f_{s w}$ that result in the maximum efficiency for a given load $I_{L}$ and input voltage $V_{i n}$. In a fully-integrated SC, the charge-sharing SSL loss constitutes the major loss component. To decrease the SSL loss, either the available capacitance or the switching frequency, and hence switching parasitics, should be increased. In integrated


Figure 2.5: The FSL performance metric $M_{F S L}$ of the seven topologies at binary conversion ratios up to 5-bit resolution.
converters, capacitance is not typically considered as a variable in the optimization process, and the maximum available capacitance for a given silicon area is implemented. The maximum efficiency over the design space $\left(A_{s w}, f_{s w}\right)$ can be found by minimizing the total losses arising from the intrinsic SC $R_{\text {out }}$, and the switching losses that result from the power switches gate drive as well as the capacitor bottom-plate losses. The drain parasitics of the switches are treated as part of the capacitors bottom-plate parasitics.

Since a RSC consists of individual 2:1 SC cells that provide binary-weighted currents $I_{i}=I_{L} / 2^{N-i}$, it can be shown that the optimal switching frequency $f_{s w}^{*}$ and total conductance $G_{t o t}^{*}$ are given by ${ }^{1}$ :

$$
\begin{equation*}
f_{s w}^{*}=\frac{1}{4 \sqrt[3]{2}} \sqrt[3]{\frac{G_{\text {on }}}{C_{\text {gate }} V_{\text {gate }}^{2}}\left(\frac{I_{L}}{C_{\text {tot }}}\right)^{2}} \cdot \sqrt[3]{\left(\frac{2^{N}-1}{2^{N-1}}\right)^{2}} \tag{2.7}
\end{equation*}
$$

[^0]\[

$$
\begin{equation*}
\frac{G_{t o t}^{*}}{C_{t o t}}=4 \sqrt[3]{4} \sqrt[3]{\frac{G_{o n}}{C_{\text {gate }} V_{g a t e}^{2}}\left(\frac{I_{L}}{C_{t o t}}\right)^{2}} \cdot \sqrt[3]{\left(\frac{2^{N}-1}{2^{N-1}}\right)^{2}} \tag{2.8}
\end{equation*}
$$

\]

where $G_{o n}$ and $C_{g a t e}$ are the switch conductance density in $S / m$ and the switch gate capacitance per unit width $F / m$, respectively. $V_{g a t e}$ is the gate drive voltage, and $G_{t o t}^{*} / C_{\text {tot }}$ is the optimal total conductance per unit capacitance. Essentially, $G_{\text {tot }}^{*} / C_{\text {tot }}$ sets the intersection point of the SSL and FSL loss components, or the SC corner frequency. The first term in Eqns. 2.7 and 2.8 depends on the technology conductance per gate drive energy loss, and the load current density per unit capacitance. The second term depends on the resolution $N$, where at 1-bit resolution the optimal values correspond to a $2: 1 \mathrm{SC}$ converter. On the other hand, with larger number of cascaded stages $N$, the optimal $f_{s w}$ and total conductance density reaches an upper limit of approximately $60 \%$ above the optimal values of a $2: 1 \mathrm{SC}$ converter utilizing the available $C_{t o t}$. Essentially, the allocated capacitance of the last stage at large $N$ becomes $C_{\text {tot }} / 2$ while supplying $I_{L}$ load current, and thus the optimal design point shifts by $\sqrt[3]{4}$. From Eqn. 2.8 , the optimal total switch area does not change from one ratio to another within a given resolution $N$, simplifying the implementation of a reconfigurable SC. However, a small change in the optimal total conductance results when the bottom-plate parasitics are significant, and an average total switch width across the various ratios slightly affects the optimal efficiency. The optimal total loss per unit ampere becomes:

$$
\begin{equation*}
\frac{P_{\text {loss }}^{*}}{I_{L}}=3 \sqrt[3]{2} \sqrt[3]{\frac{I_{L}}{C_{\text {tot }}} / \frac{G_{\text {on }}}{C_{\text {gate }} V_{\text {gate }}^{2}}} \cdot \sqrt[3]{\left(\frac{2^{N}-1}{2^{N-1}}\right)^{4}} \tag{2.9}
\end{equation*}
$$

The minimum loss at the optimal design point depends on the ratio of the current density $I_{L} / C_{\text {tot }}$ to the switch conductance per gate loss, and the required resolution $N$. However, the efficiency $\left(1+P_{\text {Loss }} / I_{L} V_{\text {out }}\right)^{-1}$ depends on the desired ratio and increases with larger output voltages $V_{\text {out }}$. For arbitrarily large resolutions $N$, the loss per ampere in Eqn. 2.9 saturates at about $2.5 \times$ the loss of a $2: 1 \mathrm{SC}$ that utilizes the same available $C_{\text {tot }}$.

### 2.3 Recursive Resolution-Reconfiguration Architecture

In order to achieve the highest possible efficiency for a given silicon area, the various ratios must be realized while ensuring $100 \%$ utilization of the available on-die capacitance. Additionally, the optimal relative sizing of the constituent capacitors and switches should be guaranteed. Unlike conventional topologies, the proposed RSC topology inherently enables recursive inter-cell connection and recursive binary slicing that can simultaneously achieve both conditions with low-complexity.

### 2.3.1 Recursive Inter-Cell Connection

The proposed recursive inter-cell connection brings individual cells in parallel instead of disabling them when realizing lower-resolution ratios. Figure 2.6 summarizes the challenge of lowering the resolution in a 4-bit RSC. The converter consists of four $2: 1 \mathrm{SC}$ cells connected in succession $C 1, C 2, C 3, C 4$ to realize $m_{\text {odd }} / 2^{4}$ ratios. As shown, the cells are allocated optimal binary sizing of the total available capacitance, $C_{t o t}$, and conductance, $G_{t o t}$. One method to realize a $1 / 2$ ratio from the 4 -bit RSC is to route the output from the first stage using an output selection multiplexer and disabling all other stages. While this will produce the correct output voltage, such an approach wastes the available capacitance in the last three cells $C 2, C 3$, and $C 4$, resulting in a $14 / 15(93.33 \%)$ reduction in the available capacitance for charge transfer, thereby incurring a $15 \times$ penalty in $R_{S S L}$.

On the other hand, the Recursive implementation connects the four $2: 1 \mathrm{SC}$ cells in parallel when a $1 / 2$ ratio is desired, as shown in Fig. 2.6, which results in $100 \%$ capacitance usage and the minimum possible $\Delta V$ for a given output charge and silicon area. Similarly, to lower the resolution from 4-bit to 2-bit, the cascade of the last two cells $C 3$ and $C 4$ is brought in parallel to the cascade of the first two cells $C 1$ and $C 2$, as shown in Fig. 2.6, ensuring optimal relative sizing, i.e. $\frac{1}{3}: \frac{2}{3}$, and $100 \%$ capacitance usage.


Figure 2.6: Resolution reduction from 4-bit to 1-bit and 2-bit, using output selection multiplexer (left) and recursive inter-cell connection (right). The dashed cells are disabled when realizing lower resolutions.

### 2.3.2 Recursive Cell Slicing

Recursive-slicing breaks down the largest cell in a cascade into binary weighted sub-cells to enable even-to-odd, and odd-to-odd, resolution reconfiguration, all while satisfying optimal sizing. For example, instead of disabling the fourth cell $C 4$ to realize a 3-bit resolution in a 4-bit SC converter, which wastes more than half of the total capacitance, one or more of the four available cells is sliced to realize six cells in total, and then the resulted cells are arrange in two parallel cascades of three cells each. In general terms, it can be shown that recursively slicing the last cell in the cascade $\mathrm{C} N$ into $(N-1)$ binary weighted cells results in the optimal solution. Such slicing achieves the optimal relative sizing when lowering the resolution, with a minimum number of sliced sub-cells and thus complexity. The resulted binary sliced sub-cells are connected in


Figure 2.7: Resolution reduction from 4-bit to 3-bit and from 3-bit to 2-bit, using output selection multiplexer (left) and recursive slicing with recursive inter-cell connection (right).
cascade, while operating in parallel with the cascade of the original $(N-1)$ stages. For example, in the 4 -bit converter shown in Fig. 2.7, the fourth cell $C 4$ is sliced into three sub-cells of binary weights ( $1 / 7,2 / 7,4 / 7$ ), and arranged in parallel to the original cascade of the stages, $C 1, C 2, C 3$ to achieve $m_{\text {odd }} / 8$ ratios.

Similarly, when lowering the resolution further from three bits to two bits for $m_{\text {odd }} / 4$ ratios, the last cells $C 3$ and $C 4_{3}$, which in parallel represent the last stage in the 3-bit cascade, are each binary sliced into two sub-cells, $\left(C 3_{1}, C 3_{2}\right)$, and $\left(C 4_{31}, C 4_{32}\right)$, respectively. Figure 2.7 shows the resulted eight cells sizing and connections of the topology implemented in this chapter. The relative sizing should be as close as possible to the illustrated weighting to achieve the peak performance, however the optimal efficiency is not critically sensitive to mismatches between the various charge-transfer capacitors. It should be noted that four cells are only technically needed in
order to realize all resolutions up to 4-bits; however, in order to guarantee $100 \%$ total capacitance utilization among all the possible resolutions while achieving optimal relative sizing, eight cells in total are instead employed.

### 2.3.3 Inter-Cell Reconfiguration Switches

This section discusses the implementation details to generate the desired ratios with a minimum set of programming switches, and hence minimum added parasitics. The required intercell reconfiguration switches can be divided into two main categories: switches to implement ratioprogramming within a specific recursion depth $N$, and switches for resolution reconfiguration.

## Ratio-Reconfiguration Switches

Figure 2.8 illustrates a simplified schematic of two $2: 1 \mathrm{SC}$ cells connected in parallel. By operating the four switches in each $2: 1$ cell from the non-overlapped clock phases, $\Phi_{1}$ and $\Phi_{2}$, the $1 / 2$ ratio is realized. In order to realize $1 / 4$ and $3 / 4$ conversion ratios in a 2 -bit RSC, the two cells in Fig. 2.8 are either connected in cascade or in stack through the added four reconfiguration switches $r_{1}, r_{2}, r_{3}$, and $r_{4}$. To realize a $1 / 4$ conversion ratio, the second cell is connected between the output port $M I D_{1}$ of the first cell and the converter ground, 0 . This is accomplished through the three reconfiguration switches $r_{2}, r_{3}$, and $r_{4}$. The first cell output side (i.e., $V_{o u t}$ ) switches $s 2_{1}$ and $s 3_{1}$ are disabled and replaced by the reconfiguration switches $r_{2}$ and $r_{3}$, and hence $r_{2}, r_{3}$ are operated through $\Phi_{2}$ and $\Phi_{1}$, respectively. As a result, the first cell output charge is routed to the intermediate node $V_{\text {int }}$ between the two cells instead of the converter output $V_{\text {out }}$. To cascade both cells, the second cell input port $I N_{\text {top }}$ is reconfigured to the intermediate node $V_{\text {int }}$ between the two cells instead of the converter input voltage $V_{i n}$. The switch $s 4_{2}$ is disabled and the reconfiguration switch $r_{4}$ is operated in its place through the same clock phase $\Phi_{2}$. Similarly, to realize the $3 / 4$ conversion ratio, the first cell charge is routed to the intermediate node $V_{\text {int }}$ through the switches $r_{2}$ and $r_{3}$, and the reconfiguration switch $r_{1}$ is operated in place of $s 1_{2}$. With


Figure 2.8: Two 2:1 SC cells interconnection through ratio-reconfiguration switches. $V_{\text {int }}$ is the inter-cell intermediate node.
such inter-cell connection, no extra series reconfiguration switches are required.
The proposed inter-cell reconfiguration switches are scalable. By replicating the same four connections between each pair of consecutive cells in an N -stage cascade, reconfiguration among the various ratios with a resolution of $m_{o d d} / 2^{N}$ can be realized. The conductance of the right half switches, $r_{1}$ and $r_{4}$, is double the conductance of the left half switches, $r_{2}$ and $r_{3}$, for optimal binary sizing.

## Resolution Reconfiguration Switches

Reconfiguration of the recursion depth (i.e., resolution) can be implemented through the same four ratio-reconfiguration switches; no additional programming switches are required. During resolution reconfiguration, the function of the reconfiguration switch pair $r_{2}$ and $r_{3}$ in Fig. 2.8 is changed from routing the cell output charge to $V_{\text {int }}$, to instead extracting charge from the intermediate node. Figure 2.9 illustrates the operation of the ratio-reconfiguration switches to reduce the resolution from 3-bit to 2-bit in a RSC. As shown in Fig. 2.9a, the converter connects three $2: 1$ cells in cascade through the reconfiguration switch blocks $R_{1,2}$ and $R_{2,3}$. The 3-bit converter employs two sub-cells $C 3_{1}$ and $C 3_{2}$ to realize the third cell $C 3$ in the cascade, for maximum resource utilization. The reconfiguration switch pairs $\left(r 1_{3_{1}}, r 1_{3_{2}}\right)$ and $\left(r 4_{3_{1}}, r 4_{3_{2}}\right)$ are
operated in parallel, to connect the two sub-cells $C 3_{1}$ and $C 3_{2}$ as one cell in series or stack with the second cell $C 2$. As shown in Fig. 2.9b, to connect the sub cells $C 3_{1}$ and $C 3_{2}$ in cascade, the inter-cell switches $r 1_{3_{1}}$ and $r 4_{3_{1}}$ are operated in place of the switches $s 2_{3_{1}}$ and $s 3_{3_{1}}$ in order to route the output of cell $C 3_{1}$ to the intermediate node $V_{\text {int } 2}$, while the reconfiguration switch $r 1_{3_{2}}$ or $r 4_{3_{2}}$ is operated in place of the switch $s 1_{3_{2}}$ or $s 4_{3_{2}}$, respectively, to realize $3 / 4$ or $1 / 4$ ratios. A similar procedure is followed for the reconfiguration block $R_{1,2}$ to connect the cells $C 1$ and $C 2$ in cascade. Finally, the second cell $C 2$ output-side switches $s 2_{2}$ and $s 3_{2}$ are operated in place of the reconfiguration switches $r 2_{2}$ and $r 3_{2}$, and a 2-bit resolution is realized as shown in Fig 2.9b,

### 2.4 Circuit Implementation

In order to validate the performance of the proposed RSC topology, a 4-bit RSC converter that realizes 15 ratios is implemented in $0.25 \mu \mathrm{~m}$ bulk CMOS process. Importantly, the RSC topology is inherently modular. Thus, design of the converter requires custom implementation of only two SC building blocks.

### 2.4.1 4-Bit Power Stage Block Diagram

Figure 2.10 shows the recursive block diagram of the implemented 4-bit power stage, consisting of the two basic $2: 1$ building blocks: boundary and transfer cells. These two building blocks are connected together to implement four reconfigurable stages: $C 1, C 2, C 3$, and $C 4$. The capacitance and conductance of the last two stages, $C 3$ and $C 4$, are recursively binary-sliced to achieve $100 \%$ capacitance utilization and optimal relative sizing across the various ratios at any resolution. The fourth cell, $C 4$, consists of three binary-sized sub-cells $C 4_{1}, C 4_{2}$, and $C 4_{3}$, while the sub-cell $C 4_{3}$ is further sliced into two sub-cells, $C 4_{3_{1}}$ and $C 4_{3_{2}}$. Similarly, the third cell $C 3$ comprises two binary weighted sub-cells, $C 3_{1}$ and $C 3_{2}$. The eight total cells are interconnected at four intermediate nodes, $V_{\text {int } 1}, V_{\text {int } 2}, V_{\text {int } 3_{1}}$, and $V_{\text {int } 3_{2}}$, through four reconfiguration blocks, $R_{1,2}$,
$R_{2,3}, R_{4_{1}, 4_{2}}$, and $R_{4_{2}, 4_{3}}$, along with a half reconfiguration block $R_{3,4}$.
As shown in Fig. 2.10, two reconfiguration-switch blocks $R_{1,2}, R_{2,3}$ are employed between the three stages $C 1, C 2$, and $C 3$ to realize recursive interconnection across the various resolutions until 3-bit operation. Similarly, another two reconfiguration-switch blocks, $R_{4_{1}, 4_{2}}$, and $R_{4_{2}, 4_{3}}$ are used to interconnect the sub-cells of the fourth stage, $C 4$, for 3-bit resolution or lower. Instead of using the typical 4-switch reconfiguration block, a 2-switch reconfiguration block $R_{3,4}$ is used to cascade the third and fourth stages, $C 3$ and $C 4$. The 2-switch reconfiguration block includes only the two switches that deliver charge to an intermediate node, and hence can be considered as a half reconfiguration block. Since the nodes $V_{\text {int } 3_{1}}$, and $V_{\text {int } 3_{2}}$ should be separate when cascading the sub-cells $C 4_{1}, C 4_{2}$, and $C 4_{3}$ to realize the 3-bit resolution, the reconfiguration block $R_{3,4}$ is further sliced into two sub-blocks, $R_{3,4_{1}}$ and $R_{3,4_{2}}$, to enable node isolation as illustrated in Fig. 2.10. The two sub-blocks $R_{3,4_{1}}, R_{3,4_{2}}$ have relative conductance of, $\frac{1}{3}: \frac{2}{3}$, respectively, to match the relative sizing between the sub-cells $\left(C 4_{1}, C 4_{2}\right)$, and $\left(C 4_{3}\right)$. Each switch in the implemented five reconfiguration blocks is assigned the optimal binary sizing of the total available conductance $G_{t o t}$, which matches the relative charge that it routes.

### 2.4.2 Reconfiguration Costs

In the implemented 4-bit converter, boundary cells extract charge from the converter input voltage $V_{\text {in }}$, (e.g., $C 1$ and $C 4_{1}$ ), or deliver charge to the converter output $V_{\text {out }}$ (e.g., $C 4_{31}$ and $C 4_{32}$ ) across all the ratios. Therefore, these boundary cells only need an extra reconfiguration switch pair to deliver charge to a neighboring cell, or shuttle the charge from a neighboring cell to the converter output $V_{\text {out }}$. On the other hand, transfer cells perform charge displacement from one stage to the next, (e.g., $C 2, C 3_{1}, C 3_{2}$, and $C 4_{2}$ ). Thus, transfer cells employ four reconfiguration switches to extract the charge from one stage and deliver it to the next. Since, all the switches are binary weighted to match the relative charge shuttled through a cell, the contribution of the extra four reconfiguration switches in a transfer cell to the flying capacitor bottom-plate parasitics
matches the contribution of the original four switches of the $2: 1$ cell. In a boundary cell, such contribution is divided by two in relation to the original switches contribution.

In total, four cells contribute a normalize added-drain-parasitics of $1 / 2$, while the remaining cells add $100 \%$. The average normalized added-drain-parasitics from the used reconfiguration switches is less than unity, or approximately $77.6 \%$ of the original switches drain parasitics. It should be noted that, in general, the drain parasitics constitute a small percentage of the gate capacitance.

### 2.4.3 Programmable-Port SC Boundary and Transfer Cells

In Fig. 2.10, each 2:1 SC cell is represented with a single capacitor and four switches. However, in the actual implementation, each cell includes two capacitors and eight switches to implement two out-of-phase $2: 1$ cells. A port state can be defined for a cell $\left(I N_{t o p}, I N_{\text {bottom }}, M I D\right)$. A boundary cell operates in one of the four port-states: $\left(V_{\text {in }}, 0, V_{\text {out }}\right),\left(V_{\text {in }}, 0, V_{\text {INT }}\right),\left(V_{\text {INT }}, 0, V_{\text {out }}\right)$, and $\left(V_{\text {in }}, V_{I N T}, V_{\text {out }}\right)$, where $I N T$ represents an inter-cell node. The first state is the typical case where the cell divides the converter input $V_{\text {in }}$ by two. In the second state, the cell extracts charge from $V_{\text {in }}$ to a neighboring cell. On the other hand, for a boundary cell to deliver charge to the output $V_{\text {out }}$ from a neighbor, the cell input or ground ports are routed from the intermediate node, INT, instead of $V_{\text {in }}$ or 0 , which results in the last two states $\left(V_{I N T}, 0, V_{\text {out }}\right)$, and $\left(V_{\text {in }}, V_{I N T}, V_{\text {out }}\right)$.

Figure 2.11 illustrates the implemented standard boundary cell. Two $180^{\circ}$ phase-shifted 2:1 SC cells are used to guarantee continuous input current through the cell input port, eliminating the need for a bypass capacitance. Since the intermediate node DC level is reconfigured at binary ratios of the input voltage, a transmission gate is used to implement the switches, with the exception of the $V_{i n}$ and ground, 0 , switches. The switches $M_{n 1,2}, M_{o 2,4}, M_{o 1,3}$, and $M_{p 1,2}$ are the original switches of the $2: 1 \mathrm{SC}$ converter which implement the typical port-state $\left(V_{i n}, 0, V_{\text {out }}\right)$. A pair of reconfiguration switches can be operated as output-side switches or input-side switches by controlling their driving phases. For instance, by operating the switches $M_{i 1}, M_{i 2}$, in Fig.
2.11 from the non-overlapped clock phases $\Phi_{1}, \Phi_{2}$, respectively, the switches $M_{i 1}, M_{i 2}$ act as output port switches. On the other hand, by driving $M_{i 1}$ from $\Phi_{2}$, and disabling $M_{i 2}$ and $M_{p 1}$, the switch $M_{i 1}$ is operated as an input-side switch and hence the cell input port becomes connected to $V_{\text {int }}$. A similar explanation can be followed to connect the cell ground port $I N_{\text {bottom }}$ to $V_{\text {int }}$ using $M_{i 2}$. Figure 2.12 illustrates the four states of a boundary cell and the implemented cell decoder functional-table.

The transfer cell is designed using the boundary cell as a starting point. At lower-resolution ratios, a transfer cell acts as a boundary cell and hence incorporates the same port-states of the boundary cell. On the other hand, a transfer cell requires two additional states to shuttle charge from one stage to the next. In such cases, the transfer cell input or ground port is connected to the previous cell output port, which is connected to an intermediate node denoted as $V_{\text {int }}$ in Fig. 2.11, while the transfer cell output port is connected to the next stage input/ground port intermediate node $V_{\text {int } 2}$. Thus, two additional port-states, $\left(I N_{\text {top }}, I N_{\text {bottom }}, M I D\right)$, are required for a transfer cell, $\left(V_{\text {int }}, 0, V_{\text {int } 2}\right)$ and $\left(V_{\text {in }}, V_{\text {int }}, V_{\text {int } 2}\right)$, respectively. Figure 2.12 illustrates the additional two states and the selection signals generated from the transfer cell decoder.

### 2.4.4 Output Voltage Regulation

Figure 2.13a shows the overall block diagram of the implemented 4-bit RSC converter chip. Two control loops are implemented in the proposed converter: an inner fine-grain loop and an outer coarse-grain loop. The inner loop, working within a single conversion ratio, should modulate either the switching frequency, $f_{s w}$, or the switched capacitance (i.e., digital capacitance modulation, or DCM [5]) for fine-grain linear output voltage regulation and adaptation under load variations. Frequency modulation is chosen in this work to simplify the implementation complexity, as individual control of split sub-cells is not required in this case. The outer loop, implemented in an all-digital fashion, reconfigures the unloaded conversion ratio to minimize the range over which linear regulation is performed, thereby minimizing efficiency degradation.

## Inner Fine-Grain Controller

The T flip-flop employed in Fig. 2.13a guarantees a $50 \%$ duty-cycle input clock to the non-overlap phase generator. A Strong-Arm comparator running at $f_{\text {comp }}$ is used to provide the clock input to the T flip-flop, as shown in Fig. 2.13a. The comparator sampling clock is produced by an on-chip current-starved oscillator that is set to twice the maximum switching frequency of the power stage; since the power stage switching frequency across all the 15 ratios does not exceed 8 MHz , the current starved oscillator is set to 16 MHz through an external bias, $V_{B}$.

## Outer Coarse-Grain Controller

Coarse-grain control in reconfigurable SC converters typically switch between discrete ratios by using a resistor string to generate ratio threshold levels [32, 22]. However, a large number of ratios requires a prohibitively large resistor string, that takes into account $R_{\text {out }}$ variation across the different ratios in order to avoid deadlock. In this work, the power stage itself is used to produce the threshold levels. By operating the SC at the maximum $f_{s w}$ and scanning through the available ratios using binary search, the optimal ratio (i.e., the ratio that provides the required output level $V_{\text {ref }}$ with minimum resistive voltage drop) can be located. The block diagram of the implemented binary search controller is shown in Fig. 2.14. A 4-bit shift register, that is supplied to the ratio-decoder, is used to hold the current ratio state of the SC power stage as shown in Fig. 2.13a. Once $S T R O B E$ is asserted, RST is triggered and the power stage is reconfigured into the $1 / 2$ ratio. Then, $E N$ is asserted, initiating the binary search procedure. As a result, the $C L K$ signal is routed directly from the on-chip oscillator, switching the power stage at 8 MHz to provide the minimal output resistance, $R_{\text {out }}$.

The proposed ratio-state code, shown in Fig. 2.14, registers consecutive comparison decisions and enables a recursive implementation of the binary controller. Once the counter overflows ( $O V R$ is asserted), the 4-bit shift-register stores the present fine-grain controller comparison decision with $V_{r e f}$. If the comparator output, $C O M P$, is zero, the present power stage output is
lower than the desired level, $V_{\text {ref }}$, and the SC is reconfigured into a larger binary-ratio at the next resolution configuration, $\left(1+R_{i-1}\right) / 2$, once the comparison decision 0 is registered at the $O V R$ edge. On the other hand, when $\operatorname{COMP}$ is 1 , the 4 -bit register shifts in 1 and the power stage is reconfigured to the lower next-resolution binary-ratio $\left(R_{i-1}\right) / 2$, where $i-1$ is the previous search iteration.

### 2.5 Experimental Verification

The proposed 4-bit Recursive SC converter was fabricated in a $0.25 \mu \mathrm{~m}$ bulk CMOS process using $0.9 \mathrm{fF} / \mu m^{2}$ MIM capacitors and thin-oxide 2.5 V MOS transistors; a die photo is shown in Fig. 2.13b. The RSC occupies $4.645 \mathrm{~mm}^{2}$ for a total capacitance of $3 n F$. A threeratio $(1 / 3,1 / 2,2 / 3)$ series-parallel (SP) SC converter was fabricated using the same technology to enable normalized performance comparison with the prototyped RSC. The implemented three-ratio SP is optimized for the same current density $0.5 \mathrm{~mA} / \mathrm{mm}^{2}$ as the prototyped 4-bit RSC.

Figure 2.15 shows the measured efficiency of the developed RSC and three-ratio SP converters, along with the results of a numerical model developed for the RSC, three-ratio SP, and 7-bit SAR topologies, with models based on the work in [33]. In addition, an ideal LDO is included for comparison. All converters are shown for a 2.5 V input voltage $V_{\text {in }}$ and a 2 mA constant load current, except for the SP which has a 1.86 mA load current to ensure equal current density. The efficiency of the RSC is measured for the following 14 ratios ( $1 / 8,3 / 16,1 / 4,5 / 16$, $3 / 8,7 / 16,1 / 2,9 / 16,5 / 8,11 / 16,3 / 4,13 / 16,7 / 8,15 / 16$ ) over an output voltage ranging from 0.1 V to 2.2 V . Interestingly, the efficiency of the RSC at the $9 / 16$ ratio falls below the RSC $1 / 2$ ratio efficiency, since the $9 / 16$ ratio $R_{S S L}$ is $3.5 \times$ larger than the $1 / 2$ ratio. The RSC and SP SC converters both achieve a peak efficiency of $85 \%$, and the numerical models are each within $1 \%$ of measurement results across the output voltage range. The large number of ratios afforded by the RSC topology enables a $38 \%$ expanded output voltage range $(0.1-2.2 \mathrm{~V}$ in contrast to
$0.2-1.6 \mathrm{~V}$ for the SP ), while achieving $6.4 \%$ and $3.5 \%$ higher efficiency at 0.79 V and 1.2 V output voltages, respectively, compared to the SP converter. The measured RSC also achieves $17.7 \%$ higher efficiency than an ideal LDO at 1.6 V . On the other hand, the SP peak efficiencies at the $1 / 3$ and $2 / 3$ ratios (at 0.68 V and 1.5 V output voltages) exceed the RSC by $5.6 \%$ and $5.3 \%$, respectively. The implemented RSC essentially takes the average of the three-ratio efficiency over the 0.52 -to- 1.6 V output range, filling the gaps between the three-ratios $(1 / 3,1 / 2,2 / 3)$ and maintaining a flatter efficiency profile. The 4-bit RSC achieves greater than 70\% efficiency over the 0.9 -to- 2.2 V output range with an efficiency improvement of $28 \%$ over the 7 -bit SAR.

Figure 2.16 shows the measured and numerically-modeled efficiency given a $940 \Omega$ resistive load for the RSC and $1 \mathrm{~K} \Omega$ load for the SP in order to mimic the operation of a CMOS digital load under DVS conditions. At 0.8 V and 1.2 V output voltages, the three-ratio SC achieves $59 \%$ and $68.7 \%$ efficiencies while the $15-$ ratio RSC achieves $8 \%$ and $7.6 \%$ higher efficiencies at the same voltages, respectively. The RSC delivers a dynamic voltage operating range from 0.04 -to- 2.16 V , which is $40.4 \%$ larger than the 3 -ratio SC output range from 0.09 -to- 1.6 V , thereby enabling wider-range DVS operation. The measured operating frequency of the RSC and SP with the external resistive load is shown in Fig. 2.17. The RSC is switched over a $45 \times$ dynamic range, from $200 \mathrm{KHz}-$ to- 9 MHz , to realize the 0.04 -to- 2.16 V output voltage range. In contrast, the SP requires a $100 \times$ frequency dynamic range, from $100 \mathrm{KHz}-$ to -10 MHz , to produce $V_{\text {out }}$ from 0.09 V to 1.6 V .

Figure 2.18 shows the measured efficiency of the $1 / 2$ RSC conversion ratio versus the load current at an output voltage of 1.15 V . In this case, greater than $80 \%$ efficiency is achieved for load currents ranging from $30 \mu \mathrm{~A}$ to 1 mA . These results illustrate the primary advantage of a frequency modulation control, where the switching frequency, as well as switching parasitics loss, scales with the load current.

The peak-efficiency of the RSC and the three-ratio SP for various power/current densities are essentially identical, since both deliver the same $1 / 2$ ratio. In DVS applications, system
battery-life is a key parameter, and for a digital load of uniform-probability power-states, the system energy efficiency is essentially the weighted average efficiency of the converter over the output voltage range. The weighted-average-efficiency is given by $\int P\left(V_{\text {out }}\right) \cdot V_{\text {out }} \eta\left(V_{\text {out }}\right) \mathrm{d} V_{\text {out }}$, where $P\left(V_{\text {out }}\right)$ is the probability of a given power-state and the integration is over the achievable converter range. Figure 2.19 shows the measured and numerically modeled weighted-averageefficiencies across the output voltage range, plotted versus current density. As shown in Fig. 2.19a, the measured weighted-average-efficiency of the RSC exceeds the SP weighted-average by $6.9 \%$ at the same current density of $0.23 \mathrm{~mA} / \mathrm{mm}^{2}$. The modeled efficiency of the RSC maintains higher weighted-average efficiency across different current densities, and approaches a $2.5 \%$ higher average than the SP at $16 \mathrm{~mA} / \mathrm{mm}^{2}$. Note that the modeled and measured results diverge after the nominal current density of $0.5 \mathrm{~mA} / \mathrm{mm}^{2}$, as the model assumes optimal total switch width given the increased current density, while the fabricated chips have fixed total conductance.

Since the SP converter can only deliver voltages up to 1.6 V , another weighted-average efficiency metric is calculated assuming that an ideal LDO is used to fill any efficiency gap. With an LDO, the the RSC still exceeds the SP measured weighted average by $3.3 \%$ at $0.23 \mathrm{~mA} / \mathrm{mm}^{2}$. At $16 \mathrm{~mA} / \mathrm{mm}^{2}$ and above, the LDO performance dominates the RSC and the SP efficiency and both converge to the same value. As shown in Fig. 2.19 b , the RSC maintains superior performance than the SP converter at higher power densities until the LDO performance dominates.

All presented numerically-modeled results employ MIM capacitors with a $1.4 \%$ bottomplate parasitic capacitance ratio. If MOS capacitors were employed in place of MIM capacitors, the $10 \%$ bottom-plate parasitics in this technology would degrade the efficiency by $12.5 \%$ across the output voltage range for a 3 nF of total flying capacitance. On the other hand, if a higher density MIM capacitance were available, for example with a MIM density of $4 \mathrm{fF} / \mu m^{2}$ and bottom-plate ratio of $4 \times$ lower, the efficiency of both the RSC and SP converters would increase at each discrete ratio. However, due to severe linear regulation away from the nominal three ratios in the SP topology, the efficiency between these ratios only marginally improves. On the other hand,
the RSC converter has explicit ratios between these gaps, and thus the efficiency of the RSC topology at these voltages is increased. For example, with $4 \mathrm{fF} / \mu m^{2}$ MIM capacitors, the weighted average efficiency of the RSC exceeds the three-ratio SP by $9 \%$ at $0.23 \mathrm{~mA} / \mathrm{mm}^{2}$, or by $6.8 \%$ when including an ideal LDO. In this example, the RSC and SP weighted averages converge at $60 \mathrm{~mA} / \mathrm{mm}^{2}$, which is $3.8 \times$ larger than the $0.9 \mathrm{fF} / \mu \mathrm{m}^{2}$ MIM capacitor case. Migrating to a more modern technology node with higher-density MIM [14, 15], MOS [23, 5, 18], Ferroelectric [22], or deep-trench capacitors [17, 19, 20] and lower parasitic switches will thus enable improved performance of the RSC over the SP topology at larger current densities.

Figure 2.20 a shows the control response to a variable stair-case voltage reference, $V_{r e f}$. The control voltage $V_{\text {ref }}$ is changed every $500 \mu \mathrm{sec}$ with variable step sizes of 650 mV maximum value. Figure 2.20 b details the transient coarse controller response when the strobe signal is activated while the SC is initially producing a 2 V output voltage. Here, the SC power stage phase clock, $c l k$, is switched at the maximum frequency while the coarse controller cycles through the various binary ratios until the output reaches the desired level after $8 \mu \mathrm{sec}$. In the third cycle of this example, the coarse controller reaches the $13 / 16$ ratio, which cannot produce the desired level $V_{\text {ref }}=2 \mathrm{~V}$, given the converter $R_{\text {out }}$. Thus, a fourth correction cycle automatically results and the Back - Off logic returns the power stage to the correct 7/8 ratio. Finally, the coarse controller hands off the regulation operation to the fine-level frequency controller where $c l k$ goes back to a normal frequency.


Figure 2.9: Realization of 2-bit resolution from 3-bit resolution RSC using the same ratioreconfiguration switches.


Figure 2.10: Recursive implementation block diagram of the 4-bit RSC converter. The implemented RSC comprises four stages of eight cells $C i$ and five reconfiguration switch blocks $R_{i, i+1}$.


Figure 2.11: Boundary and transfer cells schematic.

|  | 2:1 Cell State |  |  | $\mathrm{C}_{2} \mathrm{C}_{1} \mathrm{C}_{0}$ | So | $S_{\text {Mn }}$ | $S_{\text {Mp }}$ | $M_{i 1}$ |  | $M_{\text {i2 }}$ |  | $M_{\text {i21 }}$ |  | $M_{\text {i22 }}$ |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | IN ${ }_{\text {top }}$ | $1 N_{\text {bottom }}$ | MID |  |  |  |  | So | S1 | So | S1 | so | S1 | So | S1 |
|  | $V_{\text {in }}$ | 0 | $V_{\text {out }}$ | 011 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Boundary | $V_{\text {in }}$ | 0 | $V_{\text {int }}$ | 000 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| \&Transfer Cell | $V_{\text {int }}$ | 0 | $V_{\text {out }}$ | 010 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
|  | $V_{\text {in }}$ | $V_{\text {int }}$ | $V_{\text {out }}$ | 001 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Transfer | $V_{\text {int }}$ | 0 | $V_{\text {int2 }}$ | 110 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| Cell | $V_{\text {in }}$ | $V_{\text {int }}$ | $V_{\text {int2 }}$ | 101 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |

Figure 2.12: Boundary and transfer cells decoder truth table.


Figure 2.13: Recursive switched-capacitor voltage regulator implementation, comprising eight cells of binary weights and two control loops.


Figure 2.14: Recursive binary search controller block diagram.


Figure 2.15: Measured and model-predicted efficiency, at 2 mA fixed load current, of the fabricated 4-bit RSC versus the output voltage at an input voltage of 2.5 V . The measured three-ratio efficiency is at 1.86 mA current and the same input voltage.


Figure 2.16: Measured and model-predicted efficiency with external resistive load, modeling a digital load under DVS operation, of the three-ratio SP and the 4-bit RSC across the output voltage, at an input voltage of 2.5 V .


Figure 2.17: Measured RSC and three-ratio SC switching frequency $f_{s w}$ across $V_{\text {out }}$, using the same external resistive load in Fig. 2.16 .


Figure 2.18: Measured RSC efficiency versus the load current at $1 / 2$ ratio, while supplying 1.15 V output voltage $V_{\text {out }}$.

(b) Model-predicted RSC and SP efficiency across $V_{\text {out }}$, versus different current densities.

Figure 2.19: Measured and predicted weighted-average-efficiency versus the load current density, from 0.215 -to- $215 \mathrm{~mA} / \mathrm{mm}^{2}$, for the fabricated RSC and SP in $0.25 \mu \mathrm{~m}$ bulk CMOS.

(a) Measured stair control voltage $V_{\text {ref }}$ response.

(b) Measured coarse control transient response after strobe activation for $V_{r e f}=2 \mathrm{~V}$

Figure 2.20: Coarse-controller measured transient response to a stair control voltage $V_{\text {ref }}$. Controller transient response when strobe is activated while $V_{\text {ref }}=2 \mathrm{~V}$, showing the detailed ratio binary search operation.
Table 2.1: Comparison with Previously Published Fully-Integrated SC Converters

| Work | $[\mathbf{2 2}]$ | $[\mathbf{1 8}]$ | $[\mathbf{3 2}]$ | 3-Ratio SP | 4-bit RSC |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Technology | 130 nm | 65 nm | 180 nm | $0.25 \mu \mathrm{~m}$ | $0.25 \mu \mathrm{~m}$ |
| Capacitor Type | Ferroelectric | Bulk PMOS | On-chip | MIM | MIM |
| Chip Area $\left[\mathrm{mm}^{2}\right]$ | 0.366 | 0.64 | 1.69 | 4.33 | 4.645 |
| Total Capacitance $[n \mathrm{nF}]$ | 8 | 3.88 | 2.24 | 2.8 | 3 |
| Topology | $1,2 / 3,1 / 2,1 / 3$ step down | $1 / 3,2 / 5 \mathrm{SP}$ | 7 -bit SAR | $2 / 3,1 / 2,1 / 3 \mathrm{SP}$ | 4 -bit RSC |
| $V_{\text {in }}[V]$ | 1.5 | $3-4$ | $3.4-4.3$ | 2.5 | 2.5 |
| $V_{\text {out }}[V]$ | $0.4-1.1$ | 1 | $0.9-1.5$ | $0.2-1.6$ | $0.1-2.2$ |
| Quoted Efficiency $(\eta)$ | $93 \%$ | $74 \%$ | $72 \%$ | $85 \%$ | $85 \%$ |
| Load Current @ $(\eta)$ | $1 m A$ | $32 m A$ | $10 \mu \mathrm{~A}$ | $1.86 m A$ | $2 m A$ |

### 2.6 Conclusion

A Recursive SC converter topology is presented that achieves a flattened efficiency profile over a wide voltage range by employing $2^{N}-1$ ratios in an intelligent and modular manner. Compared to a co-fabricated three-ratio series-parallel converter, the proposed 4-bit RSC achieves a wider operating range and achieves a higher weighted average efficiency. To achieve high efficiency with a large number of ratios, the RSC topology maximizes the number of connections to the converter input supply and ground in order to minimize both the charge shuttled through the converter flying capacitors and the cascaded losses. Unlike conventional SC topologies, the RSC SSL loss converges to an upper limit $1 /\left(f_{s w} C_{t o t}\right)$ and becomes fixed for arbitrarily high resolutions $N$. The RSC loss for large resolutions $N$ thus saturates at approximately $2.5 \times$ the loss of a $2: 1 \mathrm{SC}$ that utilizes the same available $C_{t o t}$. By employing both recursive inter-cell connection and recursive slicing, all possible resolutions, $N$, and hence their ratios, can be realized without disconnecting a single capacitor and while satisfying optimal relative sizing of the constituent capacitors and switches, thereby ensuring high efficiency even at larger values of $N$. The inherent regularity and modularity of the RSC topology simplifies the implementation of arbitrarily large resolutions with $2^{N-1}$ possible ratios, resulting in opportunities to achieve greater than 15 ratios in future work.

### 2.7 Acknowledgements

This chapter is based on and mostly a reprint of the following publications:

- L.G. Salem and P.P. Mercier, "A recursive switched-capacitor DC-DC converter achieving $2^{N}-1$ ratios with high efficiency over a wide output voltage range," in IEEE Journal of Solid-State Circuits (JSSC), Dec. 2014, vol. 49, no. 12, pp. 2773-2787.
- L.G. Salem and P.P. Mercier, "An 85\%-efficiency fully-integrated 15-ratio recursive switched-capacitor DC-DC converter with 0.1-2.2V output voltage range," 2014 IEEE

International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2014, pp. 88-89.

## Chapter 3

## Flying Domain DC-to-DC Conversion

### 3.1 Introduction

Most modern system-on-chip (SoC) designs utilize multiple power domains to enable per-domain supply scaling commensurate with performance demands. A dedicated DC-DC converter for each power domain is required, which should ideally be integrated on-chip with minimal area overhead (i.e., high power density), feature a conversion ratio of 2:1 or larger to reduce input current and the number of required power pins, and operate efficiently.

Unfortunately, it is conventionally difficult to achieve both high power density and efficiency in standard CMOS, since passive energy storage elements used to process energy, i.e., inductors and capacitors, suffer from low- $Q$ or high bottom-plate parasitics, respectively. For example, switched-capacitor (SC) circuits suffer from fundamental slow-switching limit (SSL) $\frac{1}{2} C \Delta V^{2} f$ charge sharing losses that can be reduced by employing large flying capacitors, $C_{f l y}$ in Fig. 3.1a, in an attempt to minimize $\Delta V^{2}$ [21, 27, 34]. Unfortunately, large capacitors not only occupy significant area, limiting power density, but also introduce losses from bottom-plate parasitics, which limits the achievable efficiency (e.g., $<75 \%$ with baseline MOS capacitors) [35, 18]. Exotic capacitor options such as high density MIM [15], ferroelectric [22], or deep-

(a) Conventional integrated power delivery methods.

(b) Proposed flying-domain method.

Figure 3.1: Integrated DC-DC conversion methods.
trench capacitors [20, 36], can yield improved efficiency and/or power density, but still suffer from bottom plate losses and may not be available in all process technologies.

An alternative means to achieve high power density and efficiency is to stack voltage domains on top of one another, as illustrated in Fig. 3.1a [37, 38, 39]. This achieves implicit DC-DC conversion (at a $2: 1$ ratio in Fig. 3.1a) with minimal area overhead, and ideally $100 \%$ efficiency if the two loads are perfectly balanced. However, it is generally difficult to perfectly balance loads in practical applications, and thus any mismatch current must be supplied by a separate auxiliary converter, as illustrated in Fig. 3.1a, which ultimately reduces power density and efficiency to that of the auxiliary converter when only one domain is on.

To overcome the efficiency-power density trade-off in conventional DC-DC converters, this chapter presents a new class of switching converters called flying-domain (FD) power converters that do not rely on high quality passive energy storage elements to achieve high efficiency. Since $C_{f l y}$ in an SC converter is typically sized to be $>10 \times$ larger than the effective capacitance of the digital load, $C_{L}$, as will be shown later in this chapter, the flying domain
concept, shown in Fig. 3.1b, replaces the flying capacitor with the load itself. This serves to: 1) significantly decrease the converter area by removing $C_{f l y}$ altogether, thereby increasing power density, and 2) scale the bottom-plate losses by $\sim C_{L} / C_{f l y}$ (i.e., $>10 \times$ ), thereby improving efficiency. Furthermore, the output decoupling capacitance, $C_{D C}$, which is nominally used to minimize ripple in SC converters, can remain tied to the DC output node of the FD converter, $V_{0}$, and can be implemented with high-density on- or off-chip capacitors without regard to bottom plate parasitics. Unlike conventional SC converters where $C_{f l y}$ defines the SSL corner frequency and hence the switching frequency that attains minimum loss, the SSL corner frequency in FD converters is defined by the larger (and more inexpensive) decoupling capacitor, $C_{D C}$, which sets the voltage ripple across the load yet does not switch or incur bottom-plate losses, thereby lowering the SSL corner frequency and significantly reducing charge sharing losses that plague conventional SC converters. On the other hand, it will be shown that allowing ripple in digital circuits does not adversely affect the overall system-level power consumption (when considering the implicit DC-DC converter losses), and thus $C_{D C}$ can be alternately reduced in order to further increase power density.

This chapter presents the modeling and theory of FD conversion, and reports measurement results from prototype FD converters. Section 3.2 introduces the employed techniques to realize high-density and high-efficiency switching topology. Section 3.3 then uses graph theory to prove that the $2: 1 \mathrm{FD}$ concept is well-posed, and with the help of a frequency-scaled switching scheme, can enable $4: 1$ conversion with a valid steady-state. Section 3.4 describes details of circuits used to implement a reconfigurable $2: 1$ and $4: 1 \mathrm{FD}$ converter prototype in $0.18 \mu \mathrm{~m} \mathrm{SOI}$, and Section 3.5 presents measurement results. Finally, Section 3.6 concludes the chapter.

### 3.2 Achieving High Power Density and Efficiency via the FlyingDomain Technique

This section introduces models and analysis that show that flying the load instead of a capacitor can be advantageous in modern digital systems. To show this, it is first demonstrated that allowing ripple on digital loads does not adversely affect system-level power consumption, thereby enabling higher converter power density through reduction of decoupling capacitance. Then, an analytical model is derived to show that the flying capacitance in an SC converter is normally set to be much larger than the effective capacitance of the load itself, and thus flying the load instead of the capacitor yields reduced bottom-plate losses and overall improved conversion efficiency and power density. Furthermore, it is shown that adding decoupling can yield even further improved conversion efficiency by reducing the SSL corner frequency without a corresponding increase in bottom-plate parasitics.

### 3.2.1 Supply Ripple Allowance and Overall Circuit Efficiency

Switched DC-DC converters have inherent periodic output voltage variation (i.e., ripple) resulting from the switching of the constituent energy-storage elements while shuttling the charge to the output load. Typically, the clock of synchronous digital circuits, $f_{c l k}=1 / T_{c l k}$, operates faster than the switching frequency of the DC-DC converter, $f_{S W}=1 / T_{S W}$, as illustrated in Fig. 3.2. The maximum possible clock frequency of a digital load depends on supply voltage, for example in sub-micron CMOS with the following relationship: $f_{c l k} \propto\left(v_{d d}(t)-V_{t h}\right)^{\alpha} / v_{d d}(t)$ [40, 41]. Since the clock of a digital load is typically not dynamically changed within the period of a DC-DC converter cycle, the maximum clock frequency as determined by the critical path delay is set by the minimum supply voltage within any $T_{S W}$ ( $V_{D D}$ in Fig. 3.2). As a result, any periodic supply ripple $v_{d d}(t)$ above the minimum voltage, $V_{D D}$, nominally results in additional power consumption in the load without a return on performance (unless the clock frequency is


Figure 3.2: Illustration showing excess power consumption from a digital load powered by a rippled-supply.


Figure 3.3: Equivalent model of a synchronous digital circuit. The origin of symbiotic capacitance is shown in the inset.
dynamically adapted to the ripple) [13, 41].
In order to evaluate the power loss caused by such periodic supply variation, a simple model of a CMOS digital circuit is employed as shown in Fig. 3.3. Here, a digital load with activity factor $\alpha$ is represented by a switched-capacitor resistor $\alpha C_{L}$, an intrinsic decoupling capacitor $(1-\alpha) C_{L} / 2$, and an equivalent leakage resistance $R_{\text {leak. }}$. At high performance levels, leakage and short-circuit currents can be neglected [40, 42, 43]. An intrinsic (or symbiotic) decoupling capacitance [44, 45] is employed in the model as illustrated in the inset in Fig. 3.3, which originates from idle gates in the vicinity of switching gates.

At maximum power, $\alpha \approx 1$, the digital load switches $2 \times N$ clock cycles within the SC DC-DC switching period. The rippled supply waveform is thus sampled on the effective load capacitance $C_{L}$ at a $2 \times N$ sampling rate and the total power consumption of the logic from the rippled supply can be calculated by a backward-Euler discrete integration as illustrated in Fig.
3.2 and computed as:

$$
\begin{equation*}
P_{L}=2 f_{S W} C_{L} \sum_{i=1}^{N}\left(V_{D D}+\Delta V-\frac{2 \Delta V}{T_{S W}}\left((i-1) T_{c l k}+T_{c l k} / 2\right)\right)^{2} \tag{3.1}
\end{equation*}
$$

where $\Delta V$ is the peak-to-peak supply ripple. A backward-Euler summation is used since the decreasing supply ramp time $T_{S W} / 2$ is much larger than the $R C$ time constant of the nodes within the digital circuit, and hence the loss in the logic transistors' resistance $R C /\left(T_{S W} / 2\right) C \delta V^{2}$ is negligible, and the extra energy drawn from the supply, i.e. the bounded energy triangles in Fig. 3.2, is adiabatically returned to the supply. It can be shown that the total power consumption of a digital circuit in Eqn. (3.1) under a rippled-supply can be simplified to:

$$
\begin{equation*}
P_{L}=C_{L} V_{D D}^{2} f_{c l k}+1 / 2 C_{L} V_{D D} \Delta V f_{c l k} \tag{3.2}
\end{equation*}
$$

The first term in Eqn. 3.2) is the conventional power consumption of the digital load operating from a ripple-free supply of $V_{D D}$. The second term represents the extra $A C$ power loss due to the rippled supply without any gain on circuit performance. This is since although higher supply voltage enables operation at faster clock, the valley of the supply ripple defines the circuit critical path delay and hence the maximum functioning clock speed. Therefore, from the point of view of the digital load circuit, ripple generates excess waste and should be eliminated through inclusion of a large output decoupling capacitance or a complex multi-phase DC-DC converter design.

However, this is only true when considering the load in isolation from the DC-DC converter. In order to evaluate the ripple effect on the DC-DC converter efficiency, consider the case of a 2:1 SC DC-DC converter, illustrated in Fig. 3.4a. First, consider the case where no explicit decoupling, $C_{D}$, or intrinsic decoupling, $(1-\alpha) C_{L} / 2$, capacitance exists. As shown in Fig. 3.4b (left), during $\Phi_{1}$ the odd-numbered switches are turned on, connecting $C_{f l y}$ between the input voltage, $V_{I N}$, and the output voltage, $V_{O U T}$, while the equivalent load resistance, $R_{L}$, charges $C_{f l y}$ until $V_{O U T}$ reaches $V_{I N} / 2-\Delta V / 2$, as illustrated in Fig. 3.4c. On the other clock


Figure 3.4: A sample 2-to-1 SC DC-DC converter. (a) Circuit topology. (b) Phase 1 and phase 2 of the voltage divider. (c) Voltage across $C_{f l y}$ and the output voltage waveforms.
phase $\Phi_{2}$, the even-numbered switches are turned on, connecting $C_{f l y}$ in parallel to the output load $R_{L}$, while the charge stored on $C_{f l y}$ in the prior phase is released to the output load. In both cases, all the current that passes through the switches' on-resistance, $2 \times R_{\text {on }}$, passes in the load $R_{L}$, and hence the ratio between the SC loss to the output power is $2 R_{\text {on }} / R_{L}$, or essentially $R_{F S L} / R_{L}$, where $R_{F S L}$ is the fast-switching limit (FSL) resistance of the 2:1 SC [21].

Now, consider the case when a decoupling capacitance ( $C_{D C}$ ), comprising symbiotic capacitance, an explicit decoupling capacitance $\left(C_{D}\right)$, or both, is employed, as illustrated in Fig.
3.4 b (right). The presence of decoupling results in extra current through the converter switches ( $2 \times R_{\text {on }}$ ) that is not delivered to the load, $R_{L}$, and is instead sunk to ground. Interestingly, this extra loss due to the decoupling matches the reduction in the digital load $A C$ power consumption due to the reduced supply ripple, resulting in a zero-sum-game from a system-level perspective. For example, as $C_{D C}$ is further increased, the output voltage ripple decreases and more $A C$ ripple power is drained to the ground and lost as heat through the converter switches instead of being dissipated in the load. When $C_{D C}>10 \times C_{f l y}, V_{O U T}$ reaches a nearly fixed DC value, $V_{D D}$, with nearly zero-ripple, and hence the losses in the SC converter saturates to an upper bound that exactly matches the SSL loss value:

$$
\begin{equation*}
P_{S S L}=C_{f} \Delta V^{2} f_{S W} \tag{3.3}
\end{equation*}
$$

while the $A C$ power loss in the load vanishes. From charge conservation, the charge delivered to the output load during a complete converter switching period, $2 N \times C_{L} V_{D D}$, matches the total charge delivered by the SC flying capacitor, $2 \times C_{f} \Delta V$, in $\Phi_{1}$ and $\Phi_{2}$. Therefore, Eqn. (3.3) can be rewritten as:

$$
\begin{equation*}
P_{S S L}=N \times C_{L} V_{D D} \Delta V f_{S W}, \tag{3.4}
\end{equation*}
$$

which is identical to the $A C$ loss consumed by the digital load in the zero decoupling capacitance case as expressed by the second term in (3.2). Therefore, when the performance of digital circuits are dictated by the minimum supply voltage, so long as the ripple does not exceed device voltage ratings, there is no benefit to the overall system power consumption in reducing the ripple. In fact, reducing the ripple through inclusion of a larger decoupling capacitor or multiphase interleaving serves only to increase the area or overhead power of the resulting SC converter.

### 3.2.2 Switching a Capacitor versus Switching the Load

The previous subsection showed that while ripple introduces additional power consumption in a digital load, elimination of ripple through inclusion of a large decoupling capacitor just shifts these losses into SSL losses of an SC converter, resulting in a zero-sum-game at the system level. In order to improve system-level power consumption, the SC converter should be operated in the FSL regime (i.e., at a higher switching frequency), which naturally decreases ripple, thereby decreasing digital load power without a corresponding increase in fundamental SC converter losses. The FSL defines the lower bound of the SC equivalent output resistance $R_{\text {out }}$ as approximated by the Euclidean norm of the two limits $R_{\text {out }}\left(f_{s w}\right)=\sqrt{R_{S S L}^{2}+R_{F S L}^{2}}$. In other words, the lowest possible intrinsic loss in a SC converter is limited by the ESR facing the converter charge flow. Unfortunately, increasing the switching frequency results in practical loss components resulting from power switches gate drive and drain parasitics, and top and bottom-plate flying capacitor parasitics. The total loss components in a $2: 1 \mathrm{SC}$ are given by:

$$
\begin{align*}
P_{\text {loss }}= & I_{L}^{2} \sqrt{R_{S S L}^{2}+R_{F S L}^{2}}+P_{\text {gate }}+P_{\text {Bot }-c a p} \\
= & I_{L}^{2} \sqrt{\left(\frac{1}{4 C_{f l y} f_{s w}}\right)^{2}+\left(\frac{2 R_{o n}}{W_{s w}}\right)^{2}}  \tag{3.5}\\
& +4 C_{g} W_{s w} V_{D D}^{2} f_{s w}+\beta C_{f l y} V_{D D}^{2} f_{s w}
\end{align*}
$$

where $R_{o n}$ and $C_{g}$ are a unit-width switch resistance (in $\Omega . \mathrm{m}$ ) and gate capacitance (in $\mathrm{F} / \mathrm{m}$ ), respectively, $W_{s w}$ is the width of a single switch in the $2: 1$ power stage, and $\beta$ is the ratio of bottom-plate-to-flying capacitance (e.g., $\sim 10 \%$ for MOS capacitors) ${ }^{1}$. It can be shown that the optimal switch width $W_{s w}^{*}$ and switching period $T_{s w}^{*}$ are given by:

$$
\begin{equation*}
W_{s w}^{*}=2 \sqrt[3]{\frac{2 R_{o n}^{2} C_{f l y}}{R_{L}^{2} C_{g}}} \tag{3.6}
\end{equation*}
$$

[^1]\[

$$
\begin{equation*}
T_{s w}^{*}=2 R_{L} C_{f l y} \sqrt{\beta+\sqrt{2} \frac{C_{g} W_{s w}^{*}}{C_{f l y}}} . \tag{3.7}
\end{equation*}
$$

\]

With these values, the minimum normalized total loss becomes:

$$
\begin{equation*}
\frac{P_{l o s s}^{*}}{P_{L}}=\sqrt{\beta+2 \sqrt{2} \sqrt[3]{\frac{2 R_{o n}^{2} C_{g}^{2}}{R_{L}^{2} C_{f l y}^{2}}}} \tag{3.8}
\end{equation*}
$$

which matches with $<1 \%$ error to a numerical model that has been shown to match measurement results to $<1 \%$ [34]. Given this normalized loss, the achievable efficiency of an SC converter is $\eta=\left(1+P_{\text {loss }}^{*} / P_{L}\right)^{-1}$.

In order to establish a relation between the required $C_{f l y}$ for a given target efficiency, Eqn. (3.8) can be manipulated to calculate the normalized flying capacitance for a given minimum normalized loss as:

$$
\begin{equation*}
\frac{C_{f l y}}{C_{L}}=4 \tau f_{c l k} \sqrt{\frac{2 \sqrt{2}}{\left(\left(\frac{P_{l o s s}^{*}}{P_{L}}\right)^{2}-\beta\right)^{3}}} \tag{3.9}
\end{equation*}
$$

where $\tau=R_{o n} C_{g}$ can be though of as the intrinsic delay of a MOSFET. From (3.9), the minimum possible converter loss is set by the capacitance technology $\beta$ (bottom-plate ratio), where $\min \left(P_{\text {loss }}^{*} / P_{L}\right)>\sqrt{\beta}$, which agrees with intuitive analysis in prior work [33, 13].

From (3.9) and values from the ITRS predictive PIDS tables [46, 47], a flying capacitor that is at least $10 \times$ larger than the digital load capacitance is necessary to achieve $>75 \%$ efficiency, and thus interchanging $C_{f l y}$ and $C_{L}$ by switching the load itself instead of the flying capacitor results in $87.7 \%$ achievable efficiency, i.e., a $12.7 \%$ efficiency improvement when using $C_{f l y}$ as a DC capacitor. This exceeds the maximum efficiency expected from typical MIM capacitors, which have less bottom plate losses, though typically through much lower capacitive (and thus power) density.

To arrive at this efficiency improvement number, it can be shown that the equations presented in this section are applicable to FD converters, but with the decoupling capacitor $C_{D C}$


Normalized Switching Frequency [Hz]
Figure 3.5: SSL-FSL corner frequencies for SC and FD circuits.
replacing $C_{f l y}$, and $\beta$ describing the ratio of bottom-plate parasitics of the load to the capacitance of the load itself. This critical distinction from SC circuits enables FD converters to have lower bottom-plate losses (when $C_{f l y} \gg C_{L}$ ) as observed from with $C_{f l y}$ replaced with $C_{D C}$. In other words, a large and high-density $C_{D C}$ can be employed, which sets the fundamentally achievable efficiency of the circuit, but without regard for its bottom plate parasitics (since it is not switched). Essentially, while SC converters avoid the challenge of low- $Q$ CMOS inductors, FD converters eliminate the need for high- $Q$ (i.e., with low bottom-plate parasitics) capacitors.

Importantly, $C_{D C}$ also replaces $C_{f l y}$ in SSL equations, and thus in FD converters the SSL-FSL corner frequency, $1 /\left(4 R_{F S L} C_{f l y}\right)$, is set not by $C_{f l y}$, as in SC circuits, but instead by $C_{D C}$. This implies that a large $C_{D C}$, which does not switch, can be employed to set a very low SSL corner frequency. If the converter runs at the same frequency as before (in its prior SC incarnation), SSL losses, which limit the efficiency of conventional SC converters operating at practical frequencies, can be essentially eliminated, all while reducing bottom-plate losses, for a net increase in DC-DC conversion efficiency in the FD technique. Figure 3.5 illustrates the difference in SSL corner frequencies for SC and FD circuits for the conventional case of a 2:1 SC converter employing $C_{D C}=5 \times C_{f l y}$, giving a $5 \times$ improvement which results in an FD conversion efficiency of $91 \%$ given the same ITRS MOS capacitor example cited above. Furthermore, switching the load and placing the capacitor in a static state enables lower-complexity packaging solution when considering discrete SC implementations. For instance, switching large external
capacitors for 2:1 output ratio, like in [48], requires two I/O pads per flying-capacitor per load, while switching multiple loads (analogous to multi-phase interleaving, though in this case with multiple loads instead of multiple flying capacitors) around a single external decoupling cap costs only a single I/O pad.

In a bulk process, the same advantages of an FD converter can be realized, given the large bottom-plate parasitics of MOS capacitors in such technologies. The load can be placed in a separate deep-nwell and by leaving its bias floating, the deep-nwell bottom-plate parasitics can be reduced by $\sim 2 \times$, as demonstrated in a recent FD power amplifier [49].

### 3.3 State-Space Modeling of Flying-Domain DC-DC Convert-

## ers

In this section, a network-theory based analysis is followed to find the fundamental loop and cut-set matrices of an FD converter. The analysis presented in this section is based on the theory illustrated in [50, 31, 33]. A switched network can be abstracted through formulating a directed graph (digraph) to represent the topology of the switched circuit while discarding the properties of individual circuit elements, i.e. capacitors and switches. This abstract view will be utilized to derive the component voltages and charge multipliers of an FD converter, and show that the 2:1 FD topology is a properly-posed switched network with a valid stead-state output. By introducing a frequency-scaled switching technique, it will then be shown that $2^{N}: 1$ conversion ratios are possible.

### 3.3.1 Modeling a 2-to-1 Flying-Domain Converter

A 2:1 FD converter is shown in Fig. 3.6a, and operates by flying the voltage domain of the load itself, $F D_{1}$. The two phases of the circuit, when the odd and even numbered switches are
enabled respectively, are shown in Fig. 3.6b. To establish the digraph of the $2: 1$ FD converter, each component is represented by a directed branch, where the direction of each branch indicates the polarity of the component voltage or current (charge). For each phase in the circuit, a tree, $T_{1}$ and $T_{2}$, can be constructed from selected branches of each digraph, respectively. A branch in a digraph is categorized as a twig if it is contained in a tree $T$ and a link otherwise. For each link, corresponding to a selected tree, a unique closed path, i.e. fundamental loop, can be formed by the set of twigs connecting the two endpoints of that link. For each phase, a fundamental loop matrix $B^{i}$ can be established from the linearly-independent KVL equations associated with all tree links:

$$
\left[\begin{array}{lll}
1 & 1 & -1
\end{array}\right]\left[\begin{array}{c}
v_{F D 1}^{1}  \tag{3.10}\\
V_{O}^{1} \\
V_{I N}
\end{array}\right]=0 \text { and }\left[\begin{array}{ccc}
-1 & 1 & 0
\end{array}\right]\left[\begin{array}{c}
v_{F D 1}^{2} \\
V_{O}^{2} \\
V_{I N}
\end{array}\right]=0
$$

Under no-load condition, the FD converter delivers zero current and therefore each component maintains a fixed voltage across the two phases in steady-state. In such a case, the fundamental loop matrices of each phase hold simultaneously and can be combined as:

$$
\left[\begin{array}{ccc}
1 & 1 & -1  \tag{3.11}\\
-1 & 1 & 0
\end{array}\right]\left[\begin{array}{c}
v_{F D 1} \\
V_{O} \\
V_{I N}
\end{array}\right]=0
$$

By inspection, the rank of $B$ in (3.11) is two. Thus, the matrix $B$ has one degree of freedom, and the components voltages in an unloaded 2:1 FD converter can be expressed in terms of the input voltage $V_{I N}$ by manipulating (3.11) as follows:


Figure 3.6: A 2-to-1 FD DC-DC converter. (a) Circuit topology. (b) Phase 1 and phase 2 of the voltage divider, twigs shown with dark lines.

$$
\left[\begin{array}{ll}
B_{c} & b_{i n}
\end{array}\right]\left[\begin{array}{c}
v_{F D 1}  \tag{3.12}\\
V_{O} \\
V_{I N}
\end{array}\right]=0 \text {, thus }\left[\begin{array}{c}
v_{F D 1} \\
V_{O}
\end{array}\right]=\left[\begin{array}{c}
1 / 2 \\
1 / 2
\end{array}\right] V_{I N}
$$

, and hence a $2: 1$ ratio is realized.
When a load current is drawn from the FD converter, the domain $v_{F D 1}$ and DC capacitor $V_{O}$ voltages charge and discharge during each phase. In order to find the charge multipliers of each component, the fundamental cut-sets of each phase digraph (Fig. 3.6b) is examined to produce a system of KCL equations:

$$
\left[\begin{array}{ccc}
1 & -1 & 0  \tag{3.13}\\
0 & 1 & 1
\end{array}\right]\left[\begin{array}{c}
q_{O}^{1} \\
q_{F D 1}^{1} \\
q_{I N}^{1}
\end{array}\right]=0 \text { and }\left[\begin{array}{ccc}
1 & 1 & 0 \\
0 & 0 & 1
\end{array}\right]\left[\begin{array}{c}
q_{O}^{2} \\
q_{F D 1}^{2} \\
q_{I N}^{2}
\end{array}\right]=0
$$

where $q^{i}$ is the charge flow in a component in phase $i$. In order to produce a combined system of KCL equations for both phases, it can be noted that for the output decap $q^{1}=-q^{2}$ must be true at
steady state. In addition, the total charge delivered to the load $q_{F D 1}$ during the entire clock period as well as the differential charge $\Delta q_{F D 1}$ relate to the charge delivered in each phase through the relation:

$$
\left[\begin{array}{c}
\Delta q_{F D 1}  \tag{3.14}\\
q_{F D 1}
\end{array}\right]=\left[\begin{array}{cc}
-1 & 1 \\
1 & 1
\end{array}\right]\left[\begin{array}{l}
q_{F D 1}^{1} \\
q_{F D 1}^{2}
\end{array}\right]
$$

By taking the inverse of (3.14), the $F D_{1}$ charge flow during each phase in (3.13) can be replaced by the total $q_{F D 1}$ and differential charge $\Delta q_{F D 1}$, and hence the combined cut-set matrix becomes:

$$
\left[\begin{array}{ccccc}
1 & 1 / 2 & -1 / 2 & 0 & 0  \tag{3.15}\\
0 & -1 / 2 & 1 / 2 & 1 & 0 \\
-1 & 1 / 2 & 1 / 2 & 0 & 0 \\
0 & 0 & 0 & 0 & 1
\end{array}\right]\left[\begin{array}{c}
q_{O} \\
\Delta q_{F D 1} \\
q_{F D 1} \\
q_{I N}^{1} \\
q_{I N}^{2}
\end{array}\right]=0 \text { and }\left[\begin{array}{c}
q_{O} \\
\Delta q_{F D 1} \\
q_{I N}^{1} \\
q_{I N}^{2}
\end{array}\right]=\left[\begin{array}{c}
1 / 2 \\
0 \\
-1 / 2 \\
0
\end{array}\right] q_{F D 1} .
$$

The rank of the combined cut-set matrix $Q$ is four. The extra degree of freedom here is exploited to express the charge flow in each component in terms of the output charge $q_{F D 1}$ flow as shown above.

Note that both fundamental loop and cut-set matrices of the $2: 1 \mathrm{FD}$ converter are square and invertible, proving a properly-posed switched topology with a unique solution. Hence, the illustrated 2:1 FD converter is a valid topology that can reach a stable steady-state.

### 3.3.2 Modeling a 4-to-1 Flying-Domain Converter

To implement a $4: 1 \mathrm{FD}$ ratio, the input terminals of a base $2: 1$ cell, $F D_{1}$, is switched through the switches of a second flying domain converter, $F D_{2}$, as shown in Fig. 3.7a. However, operating all converter switches from a two-phase clock (which would nominally be $\Phi_{1}$ and $\Phi_{4}$ in Fig. 3.7b results in a non-square fundamental loop matrix, and hence a two-phase 4:1 FD


Figure 3.7: A 4-to-1 FD DC-DC converter. (a) Circuit topology. (b) The resulted four switching phases of a properly-posed 4:1 FD converter, twigs shown with dark lines. $V_{I N}$ is a twig in the first three phases only $\Phi_{1}, \Phi_{2}$, and $\Phi_{3}$.
converter is not properly posed, as demonstrated by the $2 \times 4$ combined loop matrix:

$$
\left[\begin{array}{cccc}
1 & -1 & 0 & 0  \tag{3.16}\\
1 & 1 & 1 & 0
\end{array}\right]\left[\begin{array}{c}
v_{F D 1} \\
v_{c 1} \\
V_{O} \\
V_{I N}
\end{array}\right]=0
$$

Fortunately, it is possible to create a valid topology by employing additional states. This can be accomplished in a a simple and low-complexity manner by operating $F D_{2}$ at half the frequency of $F D_{1}$, as shown in Fig. 3.7b, based on prior work [51, 52]. The combined four-phase fundamental
loop matrix of the proposed multi-phase 4:1 FD converter is given by:

$$
\left[\begin{array}{cccc}
1 & -1 & 0 & 0  \tag{3.17}\\
1 & 1 & -1 & 0 \\
1 & -1 & 0 & 0 \\
1 & 1 & 1 & -1
\end{array}\right]\left[\begin{array}{c}
v_{F D 1} \\
v_{c 1} \\
V_{O} \\
V_{I N}
\end{array}\right]=0
$$

By inspection, the matrix in (3.17) has a redundant row, i.e. the first and third rows are identical. Therefore, the rank of the matrix in (3.17) is three, and hence the matrix $B_{c}$ is square and invertible. This means that a $4: 1 \mathrm{FD}$ converter only requires three switching phases to have a unique solution and produce a properly-posed converter. However, the four-phase operation is motivated by the simpler switch driving circuitry and clock organization. In the four-phase case, in Fig. 3.7b, the component voltages in terms of the input voltage can be found from 3.17 as:

$$
\left[\begin{array}{lll}
v_{F D 1} & v_{c 1} & V_{O}
\end{array}\right]^{T}=\left[\begin{array}{lll}
1 / 4 & 1 / 4 & 1 / 2 \tag{3.18}
\end{array}\right]^{T} V_{I N}
$$

Thus, the proposed FD converter provides a $1 / 4$ input-to-output conversion ratio.
Similarly, component charge flow can be found through the fundamental cut-set matrix. From symmetry, it can be noted that the $4: 1$ FD converter boils down to a first-stage 2:1 FD converter that is connected once in parallel to $V_{O}$, during $\Phi_{1}$ and $\Phi_{2}$, and then to $V_{I N}-V_{O}$, during $\Phi_{3}$ and $\Phi_{4}$, with half of the total charge delivered to the load, $F D_{1}$, in each case. Therefore, $q_{C 1}$ and $q_{O}$ are balanced over the clock period of the first flying-domain, $C L K[0]$, i.e., $q_{C 1}^{1}=-q_{C 1}^{2}$ and $q_{C 1}^{3}=-q_{C 1}^{4}$, and similarly for $q_{O}$. Each component charge flow can be found as:

$$
\left[\begin{array}{lllllll}
q_{C 1} & q_{O} & \Delta q_{F D 1} & q_{I N}^{1} & q_{I N}^{2} & q_{I N}^{3} & q_{I N}^{4}
\end{array}\right]^{T}=\left[\begin{array}{lllllll}
-1 / 4 & 1 / 4 & 0 & 0 & 0 & 0 & -1 / 4 \tag{3.19}
\end{array}\right]^{T} q_{F D 1} .
$$

Similarly, other binary ratios $2^{N}: 1$ can be realized by recursively flying the input terminals of a


Figure 3.8: Schematic of the unit 2:1 FD power stage cell.
cell $F D[i]$ through the switches of a subsequent flying cell $F D[i+1]$, while operating the recursive $N$ array at binary decaying frequencies $f_{0} / 2^{i}$, where $f_{0}$ is the first cell switching frequency.

### 3.4 Circuit Implementation

This section discusses circuit implementation details of the reconfigurable FD power stage, the hysteretic control scheme, and flying level shifters [53].

### 3.4.1 Reconfigurable Power Stage Design

The proposed FD converter is designed in a modular fashion using a unit $2: 1$ cell that can be cascaded to realize additional binary ratios. The transistor-level schematic of this unit cell is shown in Fig. 3.8, and consist primarily of four power switches, M1-M4, implemented using thinoxide transistors ( 1.5 V max in the employed $0.18 \mu \mathrm{~m}$ SOI process). The power switches are driven by pre-drivers that operate with the same voltage range as the output $\left(V_{0} \sim\left(V_{H}[i]-V_{L}[i]\right)<1.5 \mathrm{~V}\right)$. The M3 and M4 pre-drivers thus operate between $\left\{V_{0}\right.$, GND $\}$, while the M1 and M2 pre-drivers


Figure 3.9: Reconfiguring the implemented FD converter between the $2: 1$ and $4: 1$ modes.
are level shifted via a 70 fF MOS capacitor $\left(C_{C}\right)$ and static latch that can operate over a wide $V_{B A T}$ range, for driving the switches between $\left\{V_{B A T}, V_{0}\right\}$. A static latch instead of a cross-coupled PMOS structure is selected in the level shifter to avoid level shifter output drift by the PMOS transistors leakage at low frequencies. To provide clock signals with wide frequency range ( $\sim 100 \mathrm{~Hz}$ to 300 MHz ) while operating down to $V_{B A T}=0.4 \mathrm{~V}$, a 4-stacked min-sized transistors $(1 / 2)$ inverter is used in the level shifter to establish a weak feedback in the latch. Such weak feedback enables a small coupling capacitor $C_{C}$ to change the latch state by the trigger $C L K \_I N$ inverter, while maintaining static retention against any leakage. Non-overlapping clock phases, $\Phi_{1}$ and $\Phi_{2}$, are then generated by 3-transistor based inverters with feedback from the power switches themselves to ensure minimal dead-time across all corners, given the wide input voltage range, and thereby eliminate any shoot-through current. According to simulations, the dead-time circuit achieves 64.8 ps and 110 ps non-overlap at $F F$ and $S S$ corners, respectively. A fork-based clock tree is used to distribute balanced $C L K$ and $C L K_{H}$ signals across the power stage in order to further reduce shoot-through currents.

The power stage can be recursively modularized [34] to implement a reconfigurable 2:1 and $4: 1$ converter as illustrated in Fig. 3.7a. In the $4: 1$ mode $\mathrm{FD}_{2}$ is operated at half the frequency of $\mathrm{FD}_{1}$ by dividing the input clock frequency using a D flip-flop. In the $2: 1$ mode, switches M1 and M4 ( S 5 and S 8 in Fig. 3.7a) are permanently enabled in $\mathrm{FD}_{2}$ via selection logic, as illustrated in Fig. 3.9. Switches S1 and S4 in FD1 Fig. 3.7a can also be permanently enabled to generate the


Figure 3.10: Plot of switch width (left axis, green dotted curves) and efficiency (right axis, red solid curves) with and without automatic conductance tracking.

1:1 ratio.

### 3.4.2 Switch-Load Voltage Matching for Optimal Conductance Tracking

When the input voltage changes or the FD converter is reconfigured between the $1 / 4$ and $1 / 2$ ratios, the optimal total switch conductance (i.e., width) could change, which would normally require enabling or disabling a portion of power switches in order to realize the optimal design point, as in similar SC implementations [35, 54], thereby limiting efficiency in ultra-lowpower applications. Fortunately, the proposed level-shifting pre-driver structures described in the preceding section ensure that power switches are automatically driven by a gate voltage that is matched to the respective cell load voltage, along with the pre-driver logic circuit (for matched overhead), when the input voltage changes. By matching the respective cell switches drive voltage $V_{s w}$ to the flying load voltage $V_{F D i}$, the switches $R_{o n}$ follows the same scaling trend of the load resistance $R_{L}\left(1 / C_{L} f_{c l k}\right)$ under DVFS, where $R_{o n}$ and $f_{\text {clk }}$ are proportional to $1 / V_{F D i}$ and $V_{F D i}$ respectively, and hence the optimal switch width $W_{s w}^{*}$ from (3.6) is almost fixed as $V_{B A T}$ or ratios change. The inherent switch-load voltage match enables direct cascading of the multi-level FD unit cells without complicated and power-consuming gate-drive-voltage conditioning circuitry, achieving a nearly optimal design point across all ratios and input voltages, as illustrated in Fig. 3.10 where a $\sim 5 \%$ efficiency improvement is observed.


Figure 3.11: Transient waveforms of the hysteretic controller and block diagram of the reconfigurable 4:1/2:1 FD converter powering an ARM Cortex M0.

### 3.4.3 Hysteretic Control

The intermediate node of the FD converter, $V_{0}$, shown in Fig. 3.8 for the $2: 1$ converter, and Fig. 3.7a for the $4: 1$ converter, is a scaled value of the AC voltage across the flying domain, yet is referenced to GND. This presents an ideal fixed-domain position to sense the flying-domain voltage change in order to control the FD converter (i.e., without having to fly comparators). As illustrated in Fig. 3.11, when the load transitions between up and down positions, the flying load voltage, $V_{L, f l y}$ jumps to $\left(V_{B A T}+\Delta V\right) / 2$ and linearly decays to $\left(V_{B A T}-\Delta V\right) / 2$ over a half cycle. Similarly, when the load is connected to the up (down) position, any capacitance at node $V_{0}$ will linearly charge (discharge) from $\left(V_{B A T} \pm \Delta V\right) / 2$. Thus, the ripple across $V_{L, f l y}$ and $V_{0}$ have the same magnitude (in this case $\Delta V$ ), but are shaped as sawtooth and triangular waves, respectively. The FD converter, either in the $2: 1$ or $4: 1$ mode, can then be controlled via a hysteretic scheme by sensing $V_{0}$ and comparing to voltage bounds $\Delta V / 2$ above and below $V_{B A T} / 2$, as shown in Fig. 3.11. The proposed hysteretic approach scales switching frequency, and thus switching parasitics


Figure 3.12: Schematic and example timing diagram of the proposed sampling level shifters.
with load current, thereby enabling high efficiency across a wide dynamic current range. Due to time limitations at design time, the hysteretic comparators were not included on-chip, and are thus implemented off-chip using Analog Devices ADCMP600 for testing purposes, though it is expected that the area and power overhead of integrated comparators will be minimal, as in conventional integrated voltage regulators. In a fully-integrated control loop, an ARM-based comparator architecture that relies on regenerative feedback latch [55] and consumes several 10s $\mu \mathrm{W}$ in $0.18 \mu \mathrm{~m}$ SOI will be employed as in prior fully-integrated SC converters [56].

### 3.4.4 Flying-Domain Interface Shifters

Since the digital load in an FD converter is flying with respect to the battery terminals, all I/O must be level-shifted from the normally fixed domain, $\left\{V_{0}, \mathrm{GND}\right\}$ into the periodically changing voltage-state of the load, $\left\{V_{B A T}, V_{0}\right\}$ or $\left\{V_{0}, \mathrm{GND}\right\}$. This is accomplished using area-efficient sampling level shifters for low-frequency signals, and faster continuous-time level shifters for high-speed signals.

The proposed sampling fixed-to-flying and flying-to-fixed level shifters are shown in Fig. 3.12, and consist of cascading an inverter powered between $\left\{V_{0}, L\right\}$, where $L$ is the dynamic low rail of the flying domain (master stage), with a latch (slave stage) operating between the FD


Figure 3.13: Schematic of the proposed continuous-time level shifters.
rails. When the FD is down (operating between $\left\{V_{0}, \mathrm{GND}\right\}$ ), the master stage is transparent and the $D$ input is passed to the slave stage output $Q$. When the FD is up (operating between $\left\{V_{B A T}\right.$, $\left.V_{0}\right\}$ ), the slave stage holds the data via static latch feedback. Thus, the sampling fixed-to-flying level shifter samples the input waveforms on the rising edge of the sampling clock, and thus the converter clock should be at least twice the frequency of the I/O signals, i.e., the Nyquist rate, as illustrated in Fig. 3.12. A similar operation follows for the sampling flying-to-fixed shifter, where the slave stage's rails are instead connected to the fixed domain. In either case, a static clamp transistor, as shown in Fig. 3.12, is added to the master stage to enable a low-impedance master stage output node and to prevent possible overstress of the master transistors at high voltage operation during the latching phase.

For I/O operating at a higher frequency than the converter clock, a continuous-time level shifter, depicted in Fig. 3.13, can be employed. In this case, the I/O is level shifted from $\left\{V_{0}\right.$, GND $\}$ to $\left\{V_{B A T}, V_{0}\right\}$ using a conventional 70 fF capacitive level shifter (and weak-feedback static latch), and a combiner circuit, implemented using a CMOS AND gate, dynamically chooses either the up or down domain signals, $A_{H}$ or $\bar{A}$, respectively, depending on the state of the FD load. The continuous shifter occupies $1.3 \times$ more area and consumes $1.6 \times$ more power than the


Figure 3.14: Die photograph of the test chip.
sampling shifter, and is thus only used for high speed signals when necessary.
In a typical digital processor, input conditioning and voltage level translation are required from the external high-voltage bus. The back-to-back latch (slave stage) of the proposed shifter can be infused within the high-voltage level translation buffering to minimize the overhead of the proposed shifters.

### 3.5 Experimental Verification

In order to experimentally verify the benefits of the proposed flying domain converter, a prototype converter was implemented and validated in a $0.18 \mu \mathrm{~m}$ SOI process. Several different versions of the converter were implemented in order to illustrate functionality and quantify performance for different load conditions: a resistive load, a cascaded inverter/ring oscillator load, and an ARM Cortex M0 processor load. In all cases, the FD converter and all its circuitry were powered via a $0.4-3 \mathrm{~V}$ input battery voltage, and all measurements were taken via 4 -wire connections. A die photo is shown in Fig. 3.14. It should be noted that while the presented design is implemented in an SOI process, which helps to reduce bottom plate parasitics, FD converters can be implemented in bulk processes if triple-well options are available.

Figure 3.15 shows the conversion efficiency results when powering an on-chip resistive load. The load was implemented using a p-type unsilicided polysilicon over shallow trench


Figure 3.15: Measured efficiency of the $2: 1 \mathrm{FD}$ converter when powering an on-chip inverter string.


Figure 3.16: Measured input and output of the cascaded inverter chain when powered by the FD converter.
isolation resistor, with $320 \Omega / \square$ with a minimum width of $10 \mu \mathrm{~m}$. The implemented resistor can be probed externally through two I/O pads, and was measured using a precision multimeter to be $645.6 \Omega$. Careful layout was followed so that the on-chip resistor suffers from minimal top/bottom capacitance due to the I/O pads, which in fact act as symbiotic capacitance to retain state during brief non-overlap periods. The FD converter occupied $295 \mu \mathrm{~m}^{2}$, including switches, pre-drivers, non-overlap circuitry, and capacitive level shifters. When powering the $645.6 \Omega$ resistor at 1.5 V , the converter achieved an efficiency of $91.7 \%$ at a power density of $11.8 \mathrm{~W} / \mathrm{mm}^{2}$ $\left(2.3 \mathrm{~W} / \mathrm{mm}^{2}\right.$ when including the 13.1 pF on-chip decoupling implemented using $10.7 \mathrm{fF} / \mu \mathrm{m}^{2}$ MIM plus MOS stacked capacitance). This measured efficiency matches the results predicted from 3.8 with $<1 \%$ error when replacing $C_{F L Y}$ in 3.8 with $C_{D C}$ for the FD converter.


Figure 3.17: Measured frequency of the on-chip ring oscillator when powered from a conventional supply and from the FD converter (a). Measured frequency offset between the two configurations (b).

To demonstrate that the proposed FD converter can power conventional digital loads, another FD converter was designed to power a cascaded chain of inverters (of total equivalent load capacitance $C_{L} \sim 31 \mathrm{pF}$ ) that can be driven directly from off-chip, or configured as a ring oscillator. To first validate that flying the inverters did not affect performance, and to further validate the performance of the fixed-to-flying and flying-to-fixed level shifters, the cascaded inverter load was first driven by an off-chip signal generator operating in the fixed domain $\left(V_{I N, f i x e d}\right)$. This signal passes through a fixed-to-flying level converter, through the flying-domain inverter chain, and then through a flying-to-fixed level converter ( $V_{\text {OUT,flying-to-fixed }}$ ). Figure 3.16 summarizes this measurement result, showing the input and output data match after a small delay and an inversion. Unlike conventional SC circuits whose efficiencies stay relatively constant at smaller currents due to the increasing SSL losses that trade-off with reduced switching frequency, the


Figure 3.18: Measured energy of the ring oscillator load when powered from a conventional supply (solid line) and from the FD converter (red circles).


Figure 3.19: Measured efficiency of the $2: 1 \mathrm{FD}$ converter when powering a ring oscillator load.
efficiency of the proposed FD converter actually increases with lower load current, as shown in Fig. 3.15. This increase in efficiency occurs due to the absence of SSL losses in the employed operating range from a reduced SSL-FSL corner frequency thanks to a 10 nF off-chip decoupling capacitance (commonly used in DC-DC converter design and as decoupling to many digital loads [17, 22, 57, 58, 59]). Since SSL losses are minimal in this operating regime, the FD converter can be modeled by the converter's FSL resistance (i.e., $2 R_{o n}$ ), driving an effective load resistance, $R_{L}$, whose magnitude increases with a reduction in current. The FD converter achieves a peak efficiency of $99.2 \%$ at a current of 0.37 mA , and is within $0.4 \%$ of the simple resistive divider circuit model depicted in Fig. 3.15 for currents ranging from 0.37 to $3.7 \mathrm{~mA}\left(V_{B A T}=3 \mathrm{~V}\right)$.

Next, the inverters were configured as a ring and powered under two different scenarios:


Figure 3.20: Measured debugging I/O waveforms of the M0 processor after level shifters.
via the FD converter, and via a regular power supply set to exactly half the input voltage of the FD converter. As shown in Fig. 3.17, the output frequency of the ring oscillator matched between the two cases to within $2 \%$ for output voltage ranging from 0.4 to 1.5 V with less than $50 \mathrm{mV}_{p p}$ ripple. To validate the efficiency of power conversion, the energy per cycle of the ring when powered directly from a regular supply was compared to total energy per cycle consumed at the input of the FD converter. As summarized in Figs. 3.18 and 3.19, the FD converter, thanks in part to the automatic optimal conductance tracking, achieved an efficiency greater than $96 \%$ from 0.4 to 1.5 V , with reduced efficiency below 0.4 V due to power switch leakage. The FD converter in this implementation also contained switches to enable the 4:1 mode (with $C_{D C, 2: 1}=810 \mathrm{pF}$ ), and at an input voltage of 1.6 V and output voltage of 0.4 V , the $4: 1 \mathrm{FD}$ converter achieved an efficiency of $90.5 \%$ when powering the ring oscillator with $0.014 \mathrm{~W} / \mathrm{mm}^{2}$ peak power density. In this implementation $4: 1$ power density was limited by low-density capacitance used for the inter-stage flying decoupling; higher power density can be achieved with denser capacitance, as in conventional SC converters.

To validate that the FD converter concept can power larger, more conventional digital loads, another FD converter was implemented to power a co-fabricated ARM Cortex M0 processor. The processor occupied $1.95 \mathrm{~mm}^{2}$, and the entire processor was switched around an off-chip 10 nF decoupling capacitor. Typical values of off-chip decoupling for commercially available ARM Cortex M0 ranges from 100 nF to $10 \mu \mathrm{~F}$ [58, 59]. All regular I/O were buffered via sampled level shifters, with the exception of the digital clock that employed a continuous level shifter for


Figure 3.21: Measured controller waveforms (a) and simulated voltage across flying load $V_{L, f l y}$ (b) under a current step from $21.8 \mu \mathrm{~A}$ to 1 mA .
higher-speed operation. The processor consumed 3.63 mA at 1.5 V when running a checksum program at 1 MHz as measured by a conventional power supply. Figure 3.20 shows measurement results of several key debugging I/Os after level shifting when flying the entire processor as the load within the FD converter, indicating successful operation. The FD converter in this case achieved an efficiency of $90.8 \%$, limited by the equivalent bottom-plate parasitics of the load itself.

The hysteretic controller, enabled by the fact that the internal $V_{O}$ node represents the output load voltage, is validated through load step measurements in Fig. 3.21a. When stepping load current from $21.8 \mu \mathrm{~A}$ to 1 mA using a step frequency input to the on-chip cascaded buffer $\left(V_{B A T}=1.5 \mathrm{~V}\right)$, the employed hysteretic controller responds within $\sim 330 \mathrm{~ns}$, and achieves a ripple lower than $50 \mathrm{mV}_{p p}$ at $V_{O}$. At steady-state, with the controller hysteresis is set to 50 mV , the controller maintains $\sim 50 \mathrm{mV}_{p p}$ and $\sim 62 \mathrm{mV}_{p p}$ voltage ripple across the flying domain while the cascaded chain sinking $21.8 \mu \mathrm{~A}$ and 1 mA , respectively, with the extra $\sim 10 \mathrm{mV}$ under

1 mA load due to the brief non-overlap time, as shown in Fig. 3.21b. The voltage ripple can be reduced through a smaller hyesteresis window. The average voltage across the flying domain drops by $\sim 35 \mathrm{mV}$ at 1 mA load, due to the internal $R_{\text {out }} \sim 35 \Omega$, as shown in Fig. 3.21b,

A table of comparisons is shown in Table 3.1.

### 3.6 Conclusion

A flying-domain power conversion concept has been presented that obviates the need for large flying passives by instead flying the load circuit itself. Experimental results from a 0.18 $\mu \mathrm{m}$ test chip reveal the proposed topology can achieve upwards of $11.8 \mathrm{~W} / \mathrm{mm}^{2}$ power density ( $2.3 \mathrm{~W} / \mathrm{mm}^{2}$ when including on-chip decoupling) at $91.7 \%$ efficiency, while achieving upwards of $99.2 \%$ peak efficiency at 0.37 mA when a large decoupling capacitor is used. The presented theory, models, and experimental results show that the FD power conversion concept can power practical digital loads with high efficiency and power density.

### 3.7 Acknowledgements

This chapter is based on and mostly a reprint of the following publications:

- L.G. Salem, J.G. Louie, and P.P. Mercier, "Flying-domain DC-DC power conversion," IEEE Journal of Solid-State Circuits (JSSC), Dec. 2016, vol. 51, no. 12, pp. 2830-2842. - L.G. Salem, J.G. Louie, and P.P. Mercier, "A flying-domain DC-DC converter powering a Cortex-M0 processor with $90.8 \%$ efficiency," 2016 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2016, pp. 234-236.

Table 3.1: Table of comparisons to prior-art SC and resonant converters achieving high power density or efficiency

|  | [17] | [22] | [15] | [20] | [57] | [60] | This work |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Technology | $\begin{gathered} \hline 45 \mathrm{~nm} \\ \text { SOI } \end{gathered}$ | 130 nm | $\begin{gathered} \hline \hline 22 \mathrm{~nm} \\ \text { Tri-Gate } \end{gathered}$ | $\begin{gathered} \hline 32 \mathrm{~nm} \\ \text { SOI } \end{gathered}$ | $\begin{gathered} 180 \mathrm{~nm} \\ \mathrm{HV} \end{gathered}$ | 65 nm | $\begin{gathered} 180 \mathrm{~nm} \\ \text { SOI } \end{gathered}$ |
| Converter type | SC | SC | SC | SC | Resonant | SC | Flying Domain |
| Topology | 1/2 | $\begin{gathered} 1,2 / 3, \\ 1 / 2,1 / 3 \end{gathered}$ | $\begin{gathered} 1,4 / 5 \\ 2 / 3,1 / 2 \end{gathered}$ | 2/3, 1/2 | 1/2 | $\begin{gathered} 1 / 2 \\ 1 / 3,1 / 4 \end{gathered}$ | $\begin{gathered} 1,1 / 2 \\ 1 / 4 \end{gathered}$ |
| Flying capacitor type | Deep trench | Ferroelectric | MIM | Deep trench | MIM | $\begin{gathered} \text { MIM / } \\ \text { MOS } \end{gathered}$ | N/A |
| Total flying capacitance [ nF ] | 2 | 8 | 1.6 | 1 | 24.6 | - | None |
| Vin [V] | 2 | 1.5 | 1.225 | 1.8 | 3 or 6 | 1.2 | 0.4-3 |
| Vout [V] | 0.95 | 0.4-1.1 | $\begin{gathered} 0.45- \\ 1.05 \end{gathered}$ | $\begin{gathered} \hline 0.7- \\ 1.1 \end{gathered}$ | $\begin{gathered} 1.85 \text { or } \\ 3.7 \end{gathered}$ | $\begin{array}{r} \hline 0.25- \\ 1.2 \end{array}$ | 0.2-1.5 |
| Power density of power stage [W/mm2] | 2.2 | 0.003 | 0.5 | 3.71 | N.R. / <br> Off- <br> chip <br> induc- <br> tors | N.R. | 11.8 |
| Power density including output decoupling [W/mm2] | N.R. / Off-chip decoupling | N.R. / Off-chip decoupling | N.R. / <br> On-chip decoupling | $\begin{gathered} \text { 3.71 } \\ \text { (multi- } \\ \text { phase } \\ \text { inter- } \\ \text { leav- } \\ \text { ing) } \end{gathered}$ | 0.91 | N.R. | 2.3 |
| Efficiency @ peak power density | 90\% | 92\% | N.R. | 90\% | 85.1\% | N.R. | 91.7\% |
| Peak efficiency | 90\% | 93\% | 84.2\% | 90\% | 89.1\% | 82\% | 99.2\% |
| Load current [mA] | 2.62 | 0.02-1 | 88 | 764 | 2081 | $\begin{gathered} 0.002- \\ 0.6 \end{gathered}$ | $\begin{gathered} 0.37 \\ 3.7 \end{gathered}$ |

N.R. = Not Reported

## Chapter 4

## A Switched-Capacitor Power Management Integrated Circuit

### 4.1 Introduction

Mobile electronic systems powered by Li-ion batteries typically employ a power management integrated circuit (PMIC) to step-down the 2.8-4.2V battery voltage. For their continuous regulation capability, switched-inductor architectures are often the preferred choice. In contrast, switched-capacitor (SC) PMICs utilize discrete capacitors that can have $7 \times$ lower BOM cost, and $8 \times$ smaller footprint than typical power inductors, yet are only efficient at discrete ratios of input-to-output.

In this chapter, a switched-capacitor (SC) PMIC is presented that achieves up to a 6.6-bit resolution with only 5 flying capacitors for inductive PMIC replacement. The flying capacitors are reused in a frequency-scaled gear train as well as charge-feedback SC topologies to attain a $2.4 \times$ reduction in capacitors number compared to prior art. In $0.25 \mu \mathrm{~m}$ bulk, the PMIC operates from an input voltage of $2.5-5 \mathrm{~V}$, can generate an output voltage ranging from $0.2-2 \mathrm{~V}$, and features an average efficiency of $90.2 \%$ across the entire range and a peak efficiency of $95.5 \%$.

### 4.2 Frequency-Scaled Gear-Train SC Topology

In conventional SC topologies, a bypass capacitor is required for charge balancing between the flying capacitors to enable a valid steady state output. For example, when switching two cascaded cells in a 4:1 SC as illustrated in Fig. 4.1, capacitors, C 1 and C2, are stacked in $\Phi 1$ and $1 / 2$ of the charge is delivered to $V_{\text {out }}$. However, without a bypass capacitor, $C_{D C}$, between the two cells, C 1 will keep charging until $V_{\text {out }}$ reaches 0 . Another viable solution would be to have another $2: 1 \mathrm{SC}$ converter cell in parallel to the second cell, C 2 , operating out of phase. Unfortunately, in both solutions three capacitors are required. On the other hand, by switching the last cell, C 2 , at twice the frequency of the first cell, C 1 , the extra required bypass or out-of-phase capacitor is eliminated. In the proposed gear train (GT) topology, also illustrated in Fig. 1, each pair of cascaded cells is operated as in a bucket brigade: the flying capacitance of each stage does not commence gathering new charge (i.e., is switched to the next clock phase) until the flying capacitor of the following stage retrieves its charge. By eliminating $C_{D C}$ that does not in fact contribute to the charge shuttling, the proposed topology reduces the incurred charge sharing loss for the same capacitance, e.g., $2.75 \times$ in the Fig. 1 example. For illustration, the proposed GT SC requires only 5 capacitors instead of a minimum of 7 in the highest conversion ratio topology from prior-art [31] to reach a ratio of 32:1.

Fig. 4.2 shows simplified examples of the proposed binary GT and charge-feedback (CF) SC. To realize binary modd/2 $2^{N}$ ratios, $\mathrm{N} 2: 1 \mathrm{SC}$ cells are connected in cascade/stack and operated at binary scaled frequencies, $\mathrm{f}_{N-1} / 2^{N-i}$, where i is the stages order in the cascade. The proposed binary topology minimizes the inter-stage loading in the low-voltage train by maximizing the number of VINT0 and 0 connections. With 5 cascaded cells, a binary 5-bit resolution can be achieved. To further extend the achievable ratios, the ground charge of the 1 st $2: 1$ cell is routed from the 2nd cell, enabling an additional $2^{N-1} \mathrm{CF}$ ratios. In Fig. 2, up to 9 unique ratios exist in the lower $2 / 3$ of VBAT, for a total of 24 ratios that can be generated from 5 capacitors, which


Figure 4.1: Proposed frequency-scaled gear-train switching scheme illustrating how to eliminate inter-stage decoupling capacitors for a $2 \times$ reduction in the required capacitors for binary ratios.
enables up to 6.6-bit ratio resolution that would otherwise require a minimum of 12 capacitors in a conventional binary SC [34, 56].

### 4.3 Circuit Implementation

The switch-level block diagram of the implemented SC PMIC is shown in Fig. 4.3. The SC enables a cascade/stack of 5 cells (schematics in Fig. 4.4) with feedback capability amongst C 0 and C 1 . The first stage of the converter uses a boundary cell to handle the high battery voltages while enabling output voltages in the lower half of VBAT. An on-chip clock divider is used to generate the various binary scaled clock phases for each stage. Instead of disabling the last cell in a cascade when realizing lower resolutions, backward cell-disabling is proposed such that the


Figure 4.2: Examples of the proposed GT-binary and CF topologies. Overall, 24 distinct ratios can be achieved with 5 capacitors.
initial cells in the cascade are the ones disabled. For instance, to realize $m_{\text {odd }} / 16$ ratios, cell C 1 is disabled, while its driving clock is bypassed to C0. Through backward disabling, the last cell, C 4 , becomes the critical cell that switches at the maximum frequency and handles the largest current, localizing the packaging ESL and ESR challenges to the single capacitor of the final stage. With the proposed binary frequency scaling, the cells optimal conductance/capacitance relative sizing becomes almost identical, though with $2 \times$ for the first cell. The boundary cell is capable of handling voltages up to 5 V while utilizing 2.5 V thin-oxide devices through deep N -well stacking, and is driven through capacitive level shifters.

### 4.4 Measurement Results

The SC PMIC is fabricated in a $0.25 \mu \mathrm{~m} 5 \mathrm{M}$ bulk CMOS process (Fig. 4.8). Through the stacked voltage domains of the first stage, the chip can operate beyond the Lithium-ion


Figure 4.3: Switch-level block diagram of the implemented converter and switching states of the implemented CF .
battery range, achieving a total input range of $2.5-5 \mathrm{~V}$. At the same time, due to the 24 available ratios, a wide output range of $0.2-2.0 \mathrm{~V}$ is achievable with output power up to 186 mW . In this implementation, $4.7 \mu \mathrm{~F}$ and $2.2 \mu \mathrm{~F} 1 \times 0.5 \times 0.6 \mathrm{~mm}^{3}$ SMD capacitors are used for the first and remaining stages, respectively. Figures 4.5 and 4.6 outline the measured and modeled efficiency at various Lithium-ion voltages for a DVS-modeling load resistors. The efficiency peaks at $95.5 \%$ and $91 \%$ and averages to $91.7 \%$ and $84.4 \%$ over $0.5-1.8 \mathrm{~V}$ with $\mathrm{VIN}=3.6 \mathrm{~V}$ at $120 \Omega$ (Fig. 4.5) and $20 \Omega$ (Fig. 4.6), respectively. Further, the proposed CF topology maximizes the utilization of the already existing 5 capacitors and realizes another 9 distinct ratios, 4 of which extend the $V_{\text {out }}$ range by up to $35 \%$ and enable operation at a low-battery voltage of 2.8 V to a $V_{\text {out }}=1.46 \mathrm{~V}$


Figure 4.4: Implementation of boundary and transfer cells.
with a $20 \Omega$ load. The CF ratios help to fill the gaps between the binary ratios, with a maximum improvement of $9.5 \%$. As shown in Figs. 4.6 and 4.7, the modeled and measured results match to within $1.5 \%$. Compared to a modeled 3-ratio series parallel (SP) converter, a prominent topology used in present commercial charge pumps with 2 flying capacitors, the proposed PMIC with only 3 extra capacitors achieves efficiency improvements of up to $22.4 \%$ with a much wider operating range. Fig. 4.7 shows the PMIC operating at a constant load current, achieving greater than $80 \%$ efficiency down to $60 \mu \mathrm{~A}$. A table of comparison with the prior work is shown in Fig. 4.9.

### 4.5 Acknowledgements

This chapter is based on and mostly a reprint of the following publication:
L.G. Salem and P.P. Mercier, "A battery-connected 24-ratio switched capacitor PMIC achieving 95.5\%-efficiency," 2015 IEEE Symposium on VLSI Circuits, Jun. 2015, pp. C340-C341.


Figure 4.5: Measured and modeled efficiency curves of the SC PMIC at 4.2V, 3.6V, and 2.8 V for a DVS-modeling load resistance of $120 \Omega$.


Figure 4.6: Measured and modeled efficiency curves of the SC PMIC at $4.2 \mathrm{~V}, 3.6 \mathrm{~V}$, and 2.8 V for a DVS-modeling load resistance of $20 \Omega$.


Figure 4.7: Efficiency of SC PMIC versus load current.


Figure 4.8: Die photo.

|  | $\mathrm{V} . \mathrm{Ng} 2012$ | H.-P. Le 2013 | S. Bandy. 2011 | This Work |
| :--- | :---: | :---: | :---: | :---: |
| Technology | $0.18 \mu \mathrm{~m} \mathrm{HV}$ | 65 nm | 45 nm | $0.25 \mu \mathrm{~m}$ bulk |
| Input voltage | $7.5-13.5 \mathrm{~V}$ | $3-4 \mathrm{~V}$ | $2.8-4.2 \mathrm{~V}$ | $2.5-5 \mathrm{~V}$ |
| Output voltage | 1.5 | 1 | $0.4-1.2 \mathrm{~V}$ | $0.2-2 \mathrm{~V}$ |
| Passive type | $8 \times 10 \mu \mathrm{~F}$ | 4 integrated <br> capacitors | $10 \mu \mathrm{H}$ inductor | $1 \times 4.7 \mu \mathrm{~F}, 4 \times 2.2 \mu \mathrm{~F}$ |
| Estimated passive vol. | $20 \times 1.35 \mathrm{~mm}^{3}$ | - | $9 \times 1.3 \mathrm{~mm}^{3}$ | $2.5 \times 0.6 \mathrm{~mm}^{3}$ |
| Estimated passive cost | $\$ 0.20$ | - | $\$ 0.27$ | $\$ 0.10$ |
| Core area | $11.6 \mathrm{~mm}^{2}$ | $0.64 \mathrm{~mm}^{2}$ | $2.25 \mathrm{~mm}^{2}$ | $3.47 \mathrm{~mm}^{2}$ |
| Voltage resolution | 7 ratios | 2 ratios | 5 -bit DPWM | 6.6 -bit $/ 24$ ratios |
| Topology | Dickson | $2 / 5,1 / 3$ | Buck | Gear Train $/$ <br> Charge Feedback |
| Peak Efficiency | $92 \%$ | $74.3 \%$ | $87.4 \%$ | $95.5 \%$ |
| Power handling | 1.5 W | 162 mW | 100 mW | 186 mW |

Figure 4.9: Comparison with prior work.

## Part II

# Miniaturizing DC-to-AC Power 

## Conversion

## Chapter 5

## A Recursive House-of-Cards Power

## Amplifier

### 5.1 Introduction

Design of battery-connected power amplifiers (PAs) that simultaneously achieve high output power, efficiency, and linearity in scaled CMOS is challenging, in part due to the low ( $\sim 1 \mathrm{~V}$ ) breakdown voltage of thin-oxide transistors. Since most modern mobile systems utilize Li-ion batteries with voltages on the order of $\sim 4 \mathrm{~V}$, step-down conversion of the battery voltage via a DC-DC converter is typically required to safely drive scaled CMOS transistors. However, delivering $>20 \mathrm{dBm}$ of output power to a $50 \Omega$ antenna requires $>5 \mathrm{~V}$ peak-to-peak swing, and thus the large battery voltage must be stepped down to drive thin-oxide CMOS transistors that perform RF waveform amplification, after which the low-voltage RF waveform is transformed back up to a higher voltage via an impedance transformation network in order to drive $50 \Omega$ with sufficient power. Converting voltages down then back up leads to cascaded losses that, in practice, limit achievable battery-to-RF efficiency of CMOS PAs to $<30 \%$ [61]. While techniques like transistor stacking can help improve the voltage blocking capability of active PA elements
[62, 63, 64], such techniques are often employed in conjunction with linear biasing strategies (e.g., class-B) that have inherently poor efficiency.

Switched-mode PAs such as class-D amplifiers can, on the other hand, theoretically achieve high efficiency, and, importantly, leverage CMOS scaling. Since switched-mode PAs in isolation do not support amplitude modulation capabilities as required by modern spectrallyefficient communication standards, they are typically implemented in conjunction with techniques to impart amplitude modulation capabilities such as envelope elimination and restoration (EER) [65, 66, 67], outphasing [68], or pulse-width modulation [69]. Alternative switching approaches such as direct RF digital-to-analog converters (DACs) [70], digital power combining [71, 72], and switched-capacitor (SC) PA [73, 74] can be used to, in some cases with the aforementioned amplitude modulation techniques, improve efficiency or linearity. However, while switched-mode PAs leverage transistor scaling, cascaded DC-DC converter, PA, and impedance transformation network losses are exasperated as supply voltages scale downwards, making design of efficient, linear, battery-connected CMOS PAs increasingly challenging.

This chapter presents the design of a digital power amplifier that, by stacking and flying unit class-D PA cells in a switched-capacitor House of Cards (HoC) topology, enables efficient high output power generation with a direct Li-ion battery connection, all while using thin-oxide CMOS transistors. Since the PA is modular and recursively reconfigurable amongst several battery-to-RF voltage ratios, high battery-to-RF efficiency is achieved both at peak and 6 dB backoff power via a voltage-mode Doherty-like capacitive power combining technique.

Initial circuit schematics and measurement results were presented in [75]. In this chapter, the proposed HoC topology is introduced and contrasted to prior-art PA approaches in Section II, while Sections III and IV provide the implementation details. Section V presents detailed measurement results, and Section VI concludes the chapter.

### 5.2 House-of-Cards Switched-Capacitor Power Amplifier

As described in the previous section, the design of PAs in scaled CMOS processes to provide high output power with high efficiency constitutes multiple challenges related to low transistor breakdown voltages. In this section, we discuss the design evolution of the HoC PA topology to realize high-output voltage swing while operating directly from a battery using scaled CMOS devices. A charge-recycling DC-DC conversion scheme is introduced to realize implicit high-efficiency DC-DC conversion directly from a high-voltage input power source. Then, the ladders of stacked PA cells are connected in cascade to provide efficient series power combining without any magnetics. Although described for CMOS processes, the proposed architecture is also suitable for other technologies such as BJTs, MESFETs, HEMTs, etc.

### 5.2.1 Implicit DC-DC Conversion via Stacked-Amplifier Charge-Recycling

Non-constant envelope modulation schemes (e.g. QPSK, QAM, OFDM, etc.) are required to provide better utilization of the available bandwidth in modern communication standards. Such high peak-to-average power ratio (PAPR) signals require a PA with high efficiency across a wide dynamic power range. Class-G supply modulation has been demonstrated to achieve high efficiency at back-off by operating a nonlinear PA from multiple supply voltage levels, typically $V_{i n}$ and $V_{i n} / 2$, as determined by the input envelope signal in an envelope elimination and restoration (EER) scheme [67]. The supply modulator can be implemented using a linear voltage regulator [76, 77, 66] or a hybrid design that includes linear regulator in parallel to a switching supply modulator [78, 79]. In either case, the efficiency of the employed DC-DC converter is critical to the overall efficiency of the system.

Figure 5.1a illustrates a nonlinear class-D PA when powered through a DC-DC converter to realize the 6 dB back-off operation in a class-G system. The modulator typically requires off-chip or large on-chip inductors and causes cascaded losses. Instead, high-efficiency implicit


Figure 5.1: Conventional class-G operation during a 6dB back-off (a). Implicit $100 \%$ efficiency DC-DC conversion via charge-recycling PA stacking (b).

DC-DC downconversion can be realized by stacking two half-sized class-D PA cells, PA1 and PA2, each of half the PA total conductance, on top of each other while coupling their outputs through a flying capacitor $C_{f l y}$, and operating the stack from $V_{i n}=2 V_{D D}$, as shown in Fig. 5.1b Each PA in the stack delivers half the total output power, $P_{\text {out }}=2 / \pi^{2} V_{D D}^{2} / R_{L}$ [73], to the load $R_{L}$. The charge dumped by the top domain, $q=\int_{0}^{T / 2} I_{o} / 2 \sin (2 \pi / T t) d t=T I_{o} /(2 \pi)$, where $T$ is the RF carrier period and $2 q$ is the total output charge delivered during half the period, matches the charge absorbed by the bottom domain, thereby the intermediate node $V_{\text {int }}$ is automatically balanced to $V_{D D}$. In a practical implementation, a small $C_{f l y}$ matches the switching phases for PA1 and PA2 and establishes a 2:1 SC DC-DC converter by reusing PA1 and PA2 switches to provide active regulation to $V_{\text {int }}$. Unlike the class-G DC-DC modulator that has to provide the total PA output power, the established 2:1 SC DC-DC sources or sinks only a small delta current due to minimal charge imbalance between the stacked domains, PA1 and PA2.

Figure 5.2 illustrates the switch-level block diagram and operation of the example 2-stack PA. The switches are controlled by the PM clock. Fig. 5.2b depicts the resulting networks during the phase when the PM clock is high $\left(\phi_{1}\right)$ and when the clock is low $\left(\phi_{2}\right)$. During $\phi_{1}$ the odd-numbered switches are turned on, connecting the flying capacitor, $C_{f l y}$, between the mid-level voltage, $V_{\text {int }}$, and ground. Consequently, capacitors $C_{f l y}$ and $C_{1}$ are connected in parallel and
charge sharing occurs to balance the voltage across $C_{1}$ to $V_{i n} / 2$ at steady state. During $\phi_{1}, R_{L}$ is $A C$-coupled to $V_{i n t}$ and GND through the switches $s_{3}$ and $s_{1}$ in parallel, while $C_{f l y}$ holds a DC voltage of approximately $V_{D D}$. From Fig. 5.2b, during $\phi_{1}$ the top PA2 charges the intermediate node $V_{\text {int }}$ by a half sinusoid with amplitude $I_{o} / 2$. Therefore, $V_{\text {int }}$ jumps by $\Delta V \approx \frac{T I_{o}}{(2 \pi)\left(C_{1}+C_{2}+C_{f l y}\right)}$. In $\phi_{2}$, the even-numbered switches are on, connecting $C_{f l y}$ in parallel to $C_{2}$ in order to balance the voltage across $C_{2}$ to $V_{\text {in }} / 2$. At the same time, ac-coupled $R_{L}$ is brought up to $V_{\text {in }}$ and $V_{\text {int }}$ through switches $s_{4}$ and $s_{2}$. On $\phi_{2}$, the charge $q=T I_{o} /(2 \pi)$ stored on the capacitors $C_{1}, C_{2}$, and $C_{f l y}$ during the prior phase is released back to supply PA1, and hence $V_{i n t}$ droops by $\Delta V$.

Alternating between the two phases $\phi_{1}$ and $\phi_{2}$ along with the boundary condition of continuous voltage across the capacitors $C_{1}, C_{2}$, and $C_{\text {fly }}$ during phase switching, enforces all capacitors voltages and $V_{\text {int }}$ to reach $V_{\text {in }} / 2$ at steady state through the imposed KVL equations, irrespective of the initial voltage level [80]. The proposed topology thereby utilizes the switches to perform simultaneous power delivery at both the DC and the $\mathrm{RF} f_{o}$ components.

The size of the capacitors $C_{1}, C_{2}$, and $C_{f l y}$ determines the amount of voltage ripple, $\Delta V$, on $V_{\text {int }}$. For $10 \%$ ripple, $C_{1}, C_{2}$, and $C_{f l y}$ should be assigned equal sizes, i.e. one third, of a total on-chip capacitance of $10 \times T I_{o} /\left(2 \pi V_{D D}\right)$. In order to reduce the amount of required capacitance, an AC virtual ground is created at $V_{\text {int }}$ in Fig. 5.2c by tying together the $V_{\text {int }}$ nodes of two 2 -stack PAs and driving them in opposite phases. Through the established differential operation, the current dumped by PA2 - into $V_{\text {int }}$ cancels the current drawn by PA1 during $\phi 1$, and vice versa in $\phi 2$, and hence the required total capacitance for DC balancing is nearly zero. Practically, $C_{1}$ and $C_{2}$ should still be large enough to decouple the required gate drive charge, i.e. $C_{1}=C_{2} \approx 10 \times C_{G}$, where $C_{G}$ is the total gate capacitance of PA1 or PA2. This decoupling capacitance is typically implemented using thin-oxide MOS gate capacitance. Unlike the power switch that is typically implemented using multiple parallel fingers with large area overhead for drain and source regions, the MOS capacitor can be implemented using a single transistor finger of almost equal width and length, and therefore in a denser manner. The parasitic top/bottom capacitors of the required

(a)

(b)

(c)

Figure 5.2: An example 2-stack PA operation from $V_{i n}=2 V_{D D}$. (a) Switch-level block diagram. (b) The resulted two switched networks of the HoC PA when the PM clock is high and low. (c) Differential operation for elimination of $V_{\text {int }}$ capacitance.
decoupling capacitance are at a fixed voltage level relative to the ground, therefore do not result in parasitic switching losses. On the other hand, $C_{f l y}$ should be set such that $1 /\left(\omega_{o} C_{f l y}\right)<2 R_{\text {on }}$, where $R_{o n}$ is the total equivalent output resistance $R_{\text {out }}$ of the PA, for phase-aligned AC operation.

The 2-stack differential PA topology provides three advantages for scaled CMOS technologies as compared to the representative class-G system when operating at 6 dB back-off as illustrated in Fig. 5.1a. First, the proposed differential topology provides the required supply, $V_{i n t}=V_{D D}$, for 6 dB back-off without any extra DC-DC converter. Even in a linear supply mod-
ulator, a large output decoupling capacitor $C_{d}$ is required to enable a stable control loop with enough phase margin [81]. The stacking topology also enables powering the PA cells from a $2 V_{D D}$ input without violating the employed thin-oxide switches breakdown voltage ${ }^{11}$. In addition, the stacked PA does not suffer from cascaded losses from a DC-DC converter in series with a PA as in conventional battery-connected CMOS PA approaches (e.g., the efficiency in Fig. 5.1a is $\left.\eta=\eta_{D C-D C} \eta_{D C-A C}\right)$. Instead, the efficiency of the 2-stack PA becomes $\eta=\left(1+R_{o n} / R_{L}\right)^{-1}$ which approaches $100 \%$ instead of $\eta$ bounded by $50 \%$ when a linear supply modulator is used. Secondly, the implicit high-efficiency switching DC-DC conversion implemented through stacking the two PA slices does not produce spurious output noise, even with the inherent 2:1 SC, where it operates at the carrier frequency $f_{o}$. On the other hand, most PAs operated from explicit DC-DC converters produce spurs at the fundamental switching frequency of the supply modulator and its harmonics. Passive filtering and postregulation through cascaded high power-supply-rejection-ratio (PSRR) linear regulators are required to circumvent the switching products from reaching the PA output, increasing cost and reducing efficiency of the overall class-G solution. Finally, by stacking 2 PA cells, the current drawn from the supply and ground grids is also reduced by a factor of 2 over the current drawn when PA cells are operated in parallel. This reduces the off-chip supply decoupling tree size and $I^{2}$ ESR losses by 2 and 4 times, respectively.

The proposed implicit charge-recycling DC-DC conversion can be generalized to realize $2 / \pi V_{D D}$ output voltage amplitude from $V_{i n}=N V_{D D}$ using $V_{D D}$-rated thin-oxide devices. Instead of stepping the input battery voltage $V_{\text {in }}$ down by $N: 1$ through a lossy and bulky DC-DC converter, the proposed approach, as illustrated in Fig. 5.3, slices a nominal PA, with conductance $G_{\text {on }}$ for a given output current driving capability $i_{o}$, into $N$ PA cells, each with conductance $G_{o n} / N$. Then, the approach stacks the $N$ PA cells, each operating at the nominal process voltage $V_{D D}$ while the entire stack is powered from $N V_{D D}$, such that the charge discarded by the top-most PA cell trickles down through the $N$-PA stack to be recycled at each level, achieving $\sim 100 \%$ DC-DC

[^2]

Figure 5.3: Implicit DC-DC conversion via charge-recycling. The same total DC capacitance, i.e. $C_{1}+C_{2}$ in Fig. 5.2, can be equally divided between the $N-1$ intermediate nodes for the same ripple amplitude as the 2 -stack PA, in the case of a non-differential operation; similar explanation for $C_{f l y}$.
efficiency. The $N-1$ flying-capacitor ladder is employed to enforce phase alignment among the stacked $N$ PA cells and establish a properly-posed topology.

### 5.2.2 High-Voltage RF Signal Generation in Scaled CMOS without Magnetics

The low breakdown voltages available with deep submicron devices limit the allowed output voltage swings, and hence the achievable peak output power across a fixed load. To supply the high output power levels required by modern communication standards with thin-oxide devices, contemporary PA design schemes resort to impedance transformation networks [82, 83, 84, 85], power combining [71, 86, 72, 61], device stacking [87, 88, 62, 64], or an amalgamation of them. Figure 5.4 illustrates representative power combining and device stacking schemes, respectively. The parallel power combining schemes shown in Fig. 5.4a (left) achieves an output voltage


Figure 5.4: Prior schemes to realize high RF power using thin-oxide devices. (a) Parallel and series power combining schemes. (b) Digital device-stacked PAs.
amplitude of $4 / \pi V_{D D}$ from $V_{i n}=2 V_{D D}$ by placing two PA cells in parallel, powering them from a DC-DC converter outputting $\sim V_{D D}$, and having each amplifier feed a 1:1 transformer whose secondary winding is connected in series with the other PA secondary, achieving 1:2 impedance transformation. Unfortunately, on-chip transformers consume significant silicon area and suffer from high losses due to low-resistivity substrate and thin metal and dielectric layers in baseline CMOS [89, 85, 72].

An alternative means to generate high output power using thin-oxide devices is to perform series power combining by transistor stacking. Essentially, such schemes DC-connect transistors in series, while engineering the RF swing at the gate of each stacked transistor on top of the input common-source to ensure that all the transistors are operating in the safe region (i.e., without exceeding $V_{G S}$ or $V_{D S} / V_{D G}$ ratings), while producing high output amplitude (Fig. 5.4a, right), as in stacked-FET PA [64, 63, 62] and high-voltage/high-power (HiVP) [87, 90]. Unfortunately, such schemes typically require an array of input and output matching networks that occupy large area, degrade the overall PA efficiency, and generally limit the operational bandwidth. More
importantly, such schemes require post-fabrication trimming of the gate networks to ensure proper in-phase voltage swing at the gate and drain to avoid device breakdown. While differential topologies can create a virtual ac ground at the gate bias points, most demonstrated prototypes in prior-art assume external, up to $N$, ideal bias sources using lab equipment, while generating the required low-impedance high-fidelity bias voltage on chip, as in real products, is extremely challenging [91].

The stacked-transistor concept can be extended to a digital cascoded structure as illustrated in Fig. 5.4b Stacking two devices practically requires an extra high-fidelity DC-DC source, as shown in Fig. 5.4 b (left). Stacking more devices is challenging. For example, the upper two switches in a four stack cascode (Fig. 5.4b, right) have to be operated from level-shifted $180^{\circ}$ - phase square PM signals $\phi_{h}$ and $\phi_{2 h}$ with $V_{D D}$ and $3 V_{D D}$ swings, respectively, in order not to violate the transistor oxide breakdown during the on-phase while not exceeding $V_{D S}$ of $V_{D D}$ in the off-phase. This requires perfect alignment, given the finite rise/fall times, between the PM input $\phi$ and the level-shifted and out-of-phase PM clocks $\phi_{h}$ and $\phi_{2 h}$, not to mention the complexity of generating a $3 V_{D D}$ swing drive signal. Stacking $N$ devices to realize $N V_{D D}$-rating cascoded switch requires $N-1$ high-fidelity biasing levels at $n V_{D D}$ with progressively increasing gate swings, which significantly increases the cost of the solution.

To generate high RF voltages using only scaled thin-oxide CMOS transistors, a House-ofCards topology is proposed that builds upon the stacked class-D concept presented earlier. An example 2-stack HoC is illustrated in Fig. 5.5. To achieve an amplitude of $4 / \pi V_{D D}$, the supply and GND of a third PA cell, PA3, is switched with respect to the power source $V_{\text {in }}$ rails through PA2 and PA1 switches, respectively, to provide voltage addition of the initial (PA1, PA2) ladder and PA3 outputs. The proposed topology essentially arranges the comprising PA cells in a HoC topology, where PA3 acts as a "flying-domain" [92]. During $\phi_{1}\left(\phi_{2}\right)$, in Fig. 5.5b, the odd (even) numbered switches are on, and hence $R_{L}$ is connected to GND $\left(V_{i n}\right)$.

The equivalent output resistance, $R_{\text {out }}$, of such a PA is $2 R_{o n}$, where $R_{o n}$ is the on-resistance


Figure 5.5: An example 2-stack HoC PA (a). Resulted phases when $\phi$ is high and low (b).
of the switches $s_{1}, s_{5}, s_{4}$, and $s_{6}$. The charge delivered to the output load $R_{L}$ does not pass through the inner switches $s_{2}$ and $s_{3}$, which are only used to balance capacitors $C_{1}$ and $C_{2}$ using $C_{f l y}$ during transients. As a result, switches $s_{2}$ and $s_{3}$ can be made of minimal size. Essentially, switches $\left(s_{1}, s_{4}\right)$ and ( $s_{5}, s_{6}$ ) form two class-D PA cells connected in cascade, where both handle the total output current, $i_{o}$, and are therefore termed AC PA cells. Through $\left(s_{2}, s_{3}\right)$, i.e. the DC PA cell, $v_{1}$ and $v_{2}$, are never left floating unlike what would occur in a conventional cascoded switcher, and hence the proposed HoC guarantees reliable operation without exceeding any device voltage rating, all in a self-contained solution without any bias circuitry or added complexity.

The proposed HoC topology can be generalized to simultaneously realize a $1: N$ voltage step up ratio and $N$ PA power combining. The proposed topology realizes $N V_{D D}$ swing from $N$ AC $V_{D D}$-rated PA cells by flying an entire $N-i$-stacked-PA ladder through the switches of a prior $N-i+1$-stacked-PA with the $N-i$-ladder input gates clamped to the intermediate nodes of the prior ladder, recursively, until the resulted lower-stack PA is a single PA cell, as shown in Fig.


Figure 5.6: An example 3-stack HoC PA (a). Fundamental AC PA cells and their gate/drain voltages to enable aligned safe operation through the clamping capacitors while performing series power combining (b).
5.6a. The commutation of the switches permits the addition of the voltage swings of the $N$ AC PA cells (i.e. voltage-domain combining). In addition, each flying PA-ladder provides automatic DC voltage balancing of the stacked domains of the prior ladder. Each clamping capacitor, DC or flying, is automatically balanced to $\sim V_{D D}$ at steady state. With the proposed gate connection, the cascaded PA-ladders, in Fig. 5.6a, are switched in a domino falling fashion with the annotated transient states of the intermediate nodes. As a result, the voltage swing at the gate and drain of each switch is perfectly aligned through the clamping capacitors to guarantee safe operation in a robust digital manner (Fig. 5.6b).


Figure 5.7: (a) Block diagram of a single ended HoC, actual implementation is differential. (b) Top-level schematic of an HoC Slice of the 16 slices, actual slice is differential.

### 5.3 Recursive HoC Amplifier Architecture

### 5.3.1 HoC Digital Power Amplifier Linearization

Among various digital linearization techniques [69, 68, 66, 67, 70, 93], the SC PA architecture [73, 74] has superior efficiency and linearity. A similar approach is followed in this work to linearize the non-linear HoC PA. Figure 5.7a illustrates the block diagram of the implemented 2-stack by 2-cascade recursive HoC power amplifier powered directly at $V_{\text {in }}=4.8 \mathrm{~V}$ using $V_{D D}=1.2 \mathrm{~V}$ thin-oxide transistors in 65 nm . The input baseband signal is oversampled and raised-cosine filtered using a DSP to generate the in-phase (I) and quadrature (Q) signals. Using a CORDIC algorithm, the digital I and Q signals are converted into a 5-bit envelope ( $A[4: 0]$ ) and phase $(\phi)$ component. A square carrier at $f_{o}$ is phase modulated by the produced phase signal through a mixer.

The generated PM clock is used to drive 16 PA slices, each sized to have conductance $G_{o n} / 16$, and each implementing a 2 -stack 2 -cascade HoC PA. As shown in Fig. 5.7b, six $V_{D D}-$ rated class D PA cells are used to implement each HoC slice. Three class-D cells are arranged in a 2-cascade HoC topology to establish two $2 V_{D D}$ swing PAs: HoC 1 and HoC 2 , which
are then stacked on top of each other to block a $V_{i n}$ of $4 V_{D D}$. The 16 PA slices share the same intermediate DC nodes ( $V_{\text {int } 1}, V_{\text {int } 2}$, and $V_{\text {int } 3}$ ), while the output of each 2-cascade HoC PA of the 32 (i.e., 16 slices, two 2-cascade HoC PAs each) is coupled to the output $V_{\text {out }}$ through a $C_{c} / 32$ capacitor.

Based on the required envelope amplitude, the 16 HoC slices are selectively enabled through the thermometer decoder to switch the bottom plates of a unary-sized MIM capacitor array, whose total capacitance is $C_{c}=25 \mathrm{pF}$, at $f_{o}$ and a voltage swing of $2 V_{D D}$. The bit $A[4]$ is employed to set the gain value of each HoC slice to one of two possible values, as will be discussed later in this section. At peak power, all HoC slices switch, while slices are gradually deactivated at backoff by connecting the bottom-plate of their $C_{C} / 32$ capacitors to GND and $2 V_{D D}$. A decoder is used to activate each of the HoC slices according to the envelope code, which is fetched at the desired sample rate to reconstruct the non-constant envelope RF output. An output inductive band-pass filter is used to resonate with $C_{c}$ and establishes an LC impedance transformation network. As shown in Fig. 5.7a, in total two-stage LC impedance transformation network is employed to transform the $25 \Omega$ load resistance ( $50 \Omega$ antenna through a balun) to $10 \Omega$ (i.e. impedance transformation $1: 2.5$ ) to generate the desired 23 dBm total output power. Thus, each LC matching stage should provide $\sqrt{2.5}$ of impedance transformation ratio for maximum bandwidth. However, the first LC stage is designed to provide an impedance transformation ratio of $\sim 1.8$, which is a little larger than $\sqrt{2.5}$ for lower charge-sharing loss while maintaining reasonable bandwidth, as will be discussed. Therefore, the required loaded quality $Q_{l}$ of the first LC matching stage is $\sim 0.88$, which sets the value of $C_{c}$ as 25 pF at 0.72 GHz for white space mobile market.

Modulating the output amplitude by controlling the number of actively switching PA slices essentially resembles a controllable capacitive voltage divider to a constant-envelope $2 V_{D D}$ square wave, as shown in Fig. 5.8. Here, $K$ is the number of unary-sized slices to enable $\log _{2}(K)$-bit envelope resolution. As the envelope code, $i$, is increased, more capacitors are switched between


Figure 5.8: Equivalent circuit of the implemented Switched-Capacitor HoC PA.

GND and $2 V_{D D}$ through an $i G_{o n} / k$ conductive path, while $K-i$ capacitors are statically pulled down through a $(K-i) G_{o n} / K$ path. Since the input port of the employed matching network is inductive, the matching can be consider as high-impedance during the fast transition of the input square signal. Therefore, the output voltage $V_{\text {out }}$ is determined by the voltage divider in Fig. 5.8 as:

$$
\begin{equation*}
V_{o u t}=\frac{2}{\pi} \frac{i}{K} 2 \times V_{D D} \tag{5.1}
\end{equation*}
$$

As readily seen, the series combination of the capacitor array $i(K-i) / K^{2} C_{c}$ must be charged and discharged once per the RF cycle, thus the array charge sharing loss $P_{c s}$ is:

$$
\begin{equation*}
P_{c s}=\frac{i(K-i)}{K^{2}} C_{c}\left(2 \times V_{D D}\right)^{2} f_{o} . \tag{5.2}
\end{equation*}
$$

By employing a series inductive reactance, the series capacitive reactance $1 /\left(\omega_{o} C_{c}\right)$ can be cancelled at $f_{o}$, thereby enabling a significant reduction in the employed $C_{c}$ to an extent dependent on the unloaded quality of the employed inductor, $Q_{\text {ind }}$. Through a larger inductance $L$ at a given $R_{L}$, a higher loaded quality factor $Q_{l}=\omega_{o} L / R_{L}=1 /\left(\omega_{o} C R_{L}\right)$ can be realized. This results in a smaller array capacitance, and hence the $P_{c s}$ can be reduced, as demonstrated in [73]:

$$
\begin{equation*}
\eta_{\text {drain }}=\left(1+\frac{\pi}{4} \frac{(K-i)}{i} \frac{1}{Q_{l}}\right)^{-1} . \tag{5.3}
\end{equation*}
$$

Figure 5.9 shows the drain efficiency of a SC PA with an array size of 32 unit capacitors.


Figure 5.9: Drain efficiency of a 5-bit SC PA and a class-G-like HoC. Comparison of the drain efficiency of the proposed Doherty-like HoC (5.5) and a conventional 5-bit SC PA with two supplies $V_{D D}$ and $2 V_{D D}$ 5.6), all at $Q_{l}$ of 0.5 .

As shown, with a reasonable on-chip $Q_{\text {ind }}$ of $10-15$, the SC PA efficiency falls by $60 \%$ at 6 dB back-off. Techniques are thus required to enhance efficiency at back-off.

### 5.3.2 Voltage-Mode Magnetic-less Swapping Doherty for High Average Efficiency

To realize high efficiency at back-off in a fully-integrated, magnetic-less, and reconfigurable approach, the proposed PA can reconfigure each PA slice, containing two stacked HoC cells (HoC1 and HoC2 in Fig. 5.10a(left)) with $2 V_{D D}$ output swings, into a stack of four class-D PA cells whose outputs are capacitively coupled to provide $V_{D D}$ output swings. In this manner, the charge-sharing losses of the capacitor array, $P_{c s}$ in (5.2), can be scaled by the same factor as the output power at 6 dB back-off (i.e., four times), and hence the HoC PA realizes a second efficiency peak at 6 dB back-off that matches the peak $P_{\text {out }}$ efficiency, as illustrated in Fig. 5.9 (class-G-like $\mathrm{HoC})$. Since the overall PA supply voltage, $V_{i n}=4 V_{D D}$ is not changed, the reconfigurable HoC amplifier can be considered as a solid-state RF impedance transformer that achieves two voltage


Figure 5.10: Reconfiguring the HoC slice transformation ratio from $1: 2$ to $1: 1$ to achieve high-efficiency at back-off (a). Simplified equivalent circuit (b).
transformation ratios, $1: 2$ and 1:1, as in Fig. 5.10b. The available two transformation ratios boost the achievable resolution by a one-bit, and thus $0 \leq i \leq 2 K$.

The MSB of the envelope code, $A[4]$ in Fig. 5.7a, is used to set the transformer ratio. The remaining four least significant bits, $A[3: 0]$, are used to enable fine-grain amplitude resolution. There are two ways to accomplish fine-grain amplitude modulation: Class-G-like and Dohertylike, which differ in how to utilize the inactive slices.

## Class-G-like HoC back-off

At the 1:2 transformation ratio (i.e., $A[4]=1$ ), $A[3: 0]$ can be employed through the decoder in Fig. 5.7a to adapt the number of actively switching slices with $2 V_{D D}$ swings, $i-K$, while the remaining $2 K-i$ slices (where $K \leq i \leq 2 K$ ) are statically connected low. In this case, the drain efficiency is similar to (5.3) but replacing $K$ with $2 K$. As shown in Fig. 5.9 ("Class-G-like HoC"), the HoC suffers from a discontinuous efficiency profile near the transition point in between the two transformation ratios, since all the capacitors $C_{c}$ in the HoC array are switched through $V_{D D}$ input voltage swing to produce the -6 dB back-off amplitude value and therefore the PA efficiency jumps suddenly at the -6 dB code to the ideal $100 \%$ value. This resembles the operation of a conventional class-G PA that operates through a $100 \%$-efficient DC-DC converter to produce the $-6 \mathrm{~dB} V_{i n} / 2$ supply.

## Doherty-like HoC back-off

To improve efficiency at back-off, when $A[4]=1$ the $2 K-i$ inactive slices instead switch the bottom-plate of their coupling capacitors with a swing of $V_{D D}$ rather than being static. Essentially, the input signal is amplified through two voltage-mode power amplifier paths, a main amplifier path with $V_{D D}$-swing and a peaking amplifier path with $2 V_{D D}$-swing, as shown in Fig. 5.11a. The two paths are simply combined through a programmable capacitive voltage-divider network to generate amplitudes between $V_{D D}$ and $2 V_{D D}$, according to $A[3: 0]$, and hence the output voltage becomes

$$
\begin{equation*}
V_{o}=\frac{2}{\pi}\left(\frac{i-K}{K} 2 \times V_{D D}+\frac{(2 K-i)}{K} V_{D D}\right) \tag{5.4}
\end{equation*}
$$

for $K \leq i \leq 2 K$. This way, the $K$-capacitor array is charged and discharged through only the amplitude difference between the two amplifiers, i.e. $V_{D D}$, instead of $2 V_{D D}$ in the old code, reducing the charge sharing losses by $4 \times$, and enhancing the efficiency profile between the two ratios to exactly follow a Doherty back-off profile.


Figure 5.11: Equivalent circuit of the HoC while generating amplitudes between the two transformations ratios (a). Load-pull characteristics of the HoC for $K \leq i \leq 2 K$ (b). Normalized voltages and admittances of the main and peaking amplifiers (c). (d) Swapping Doherty illustration.

The operation, to a great extent, is similar to the 2-way Doherty configuration, where capacitive load-pull of the main amplifier occurs. Rather than treating the two amplifiers in the 2way Doherty as current sources, the main and peaking amplifiers are employed as voltage sources of different amplitude levels $V_{M}=(2 K-i) / K \times V_{D D}$ and $V_{P}=(i-K) / K \times 2 V_{D D}$, respectively. In the proposed voltage-domain combining, the load admittance, rather than impedance in currentmode Doherty, is gradually lowered once the auxiliary amplifier is on, as in Fig. 5.11 b and

Fig. 5.11c. Unlike the classical Doherty implementations that disable the peaking amplifier at back-off, wasting silicon area, a "swapping Doherty" architecture is used where at back-off the peaking amplifier slices are reconfigured (i.e. swapped) to act as the main amplifier, realizing $100 \%$ resource utilization, as in Fig. 5.11d The efficiency under such operation becomes

$$
\begin{equation*}
\eta_{\text {drain } \mid \text { Doherty }}=\left(1+\frac{\pi}{4} \frac{(i-K)(2 K-i)}{i^{2}} \frac{1}{Q_{l}}\right)^{-1} \tag{5.5}
\end{equation*}
$$

for $K \leq i \leq 2 K$. To the best of our knowledge, the continuous efficiency transition through the second amplitude coding scheme was first noted in [74] in a class-G SC PA.

It is important to note that, the conventional class-G PA with multiple supplies can not achieve the efficiency profile of the Doherty configuration even with the discussed Doherty amplitude coding. This is since the secondary efficiency peak at 6 dB back-off is reduced by the cascaded losses of the back-off DC-DC converter. By adding the normalized loss incurred for supplying the power of the main PA, the efficiency of such approach can be given by:

$$
\begin{equation*}
\eta_{\text {drain }}=\left(1+\frac{K(2 K-i)}{i^{2}}\left(\frac{1}{\eta_{D C-D C}}-1\right)+\frac{\pi}{4} \frac{(i-K)(2 K-i)}{i^{2}} \frac{1}{Q_{l}}\right)^{-1} \tag{5.6}
\end{equation*}
$$

On the other hand, through the implicit $100 \%$ DC-DC conversion, the HoC topology can realize the exact 2-way Doherty efficiency profile, as shown in Fig. 5.9, without any bulky transformer or an extra DC-DC converter.

### 5.3.3 Stacked-FET AM-AM and AM-PM Distortion

Unlike typical digital-to-analog (DAC) converters in mixed-signal applications, the SC RFDAC [73] provides high output power levels, and hence requires large switches to achieve small equivalent on-state resistances. Unfortunately, the switches' gate and drain parasitic capacitance are linearly proportional to the switch width. Consequently, the minimum loss point between
the conduction and switching components is such that the switches' on-conductance $i / K G_{o n}$ is comparable to the series reactance of the capacitors $j \omega_{o} i / K C_{c}$ of the employed DAC. Therefore, not only the capacitor mismatch but, more importantly, the switch on-resistance mismatch affect the RF-DAC linearity. It is important to note that while it is relatively easy to realize capacitors' size matching within $1 \%$ accuracy in CMOS technologies, it is hard to control the transistors' on-resistance ratios and hence the DAC non-linearity is dominated by the switches' matching.

The effect of the on-resistance mismatch between the constituent switches on the DAC linearity including the code-dependent AM-AM and AM-PM distortions can be evaluated from the equivalent circuit in Fig. 5.8. As previously discussed, to realize the $i^{\text {th }}$ code amplitude, $i$ capacitors are switched between GND and $N V_{D D}$ through the equivalent output conductance $G_{\text {out }}$ of the actively switching $i$ HoC PA slices $i / K \times G_{o n}$. On the other hand, $K-i$ capacitors are statically held down through $K-i$ NMOS switches $(K-i) / K \times G_{o n}$. Each switching PA slice comprises a pull-up PMOS and a pull-down NMOS paths of equivalent on-resistance of $R_{p}$ and $R_{n}$, respectively. Therefore, for a $50 \%$ duty cycle input, the equivalent PA slice output resistance is essentially the average of each resistance, $\left(R_{p}+R_{n}\right) / 2$. If perfect on-resistance matching between the PMOS and NMOS switches is realized, i.e. $R_{p}=R_{n}=R$, neither AMAM nor AM-PM distortion would result, assuming zero capacitor mismatch. Unfortunately, such matching is almost impossible to realize which results in voltage-division ratio mismatch between the equivalent output conductance of the actively switching PA slices $i / K \times G_{o n}$ and the statically enabled NMOS pull-down switches $(K-i) / K \times G_{o n}$. Consequently, the amplitude-code $i$ dependent AM-AM and AM-PM distortions result.

Assuming perfect matching between the comprising capacitors, the worst-case (peak) AM-AM and AM-PM distortion at each envelope code $i$ due to the on-resistance mismatch can be evaluated by assuming each PMOS switch in the PA takes on the maximum possible on-resistance deviation, e.g. $\mp 6 \sigma_{R P}$, while each NMOS switch approaches the opposite on-resistance extreme variation, e.g. $\pm 6 \sigma_{R N}$, simultaneously, under a Gaussian distribution of the on-resistance. Thus,


Figure 5.12: Maximum distortion value due to on-resistance mismatch in a 4-bit DAC (a) with a total conductance $G_{o n}$ of $0.1 \Omega^{-1}$ and (b) with $G_{o n}$ of $1 \Omega^{-1}$. Total $C_{c}=25 \mathrm{pF}$ and $f_{o}=1 \mathrm{GHz}$.
the upper-bound on the resulted AM-AM and AM-PM distortion at each $i$ can be evaluated analytically through the output amplitude produced through the capacitive divider network in Fig. 5.8 with the new on-resistance values. In that case, the equivalent output resistance of the actively switching PA slices becomes $i / K \times G_{o n}-1 /\left(6 \sigma_{R}\right)$, assuming the PMOS and the NMOS switches in each actively switching PA slice deviates in the same direction by $6 \sigma_{R}$ for worst-case calculation, while the static pull-down NMOS switches approaches $(K-i) / K \times G_{o n}+1 /\left(6 \sigma_{R}\right)$, where $\sigma_{R P}=\sigma_{R N}$ for simplicity.

Figure 5.12 illustrates the calculated peak value of the resulted AM-AM and AM-PM at each amplitude code $i$ for a $6 \sigma_{R}$ of $25 \% R$, under different values of total PA conductance $G_{\text {on }}$. The analytical expression derived through the capacitive voltage divider in Fig. 5.8 for the output voltage amplitude provides AM-AM and AM-PM distortion values within $10 \%$ of the schematic simulation results. As shown in Fig. 5.12, the maximum AM-AM distortion occurs at $i=1$ and the peak code $i=16$, where the switch on-resistance deviation is manifested in the voltage-divider expression, in either case. When the total PA conductance $G_{o n}$ is increased 10 times, the maximum AM-AM and AM-PM distortions are reduced by 37.6 times and 2.8 times, respectively. Thus, wider PA switches not only enhance the PA drain efficiency, but also improve the PA AM-AM distortion. Since the PA exhibits a mild second order nonlinearity in
amplitude and phase, a digital predistortion can be easily employed to realize higher linearity. It is important to note that if the opposite on-resistance deviation polarity is instead assumed, i.e. $i / K \times G_{o n}+1 /\left(6 \sigma_{R}\right)$ and $(K-i) / K \times G_{o n}-1 /\left(6 \sigma_{R}\right)$ in Fig. 5.8, the illustrated AM-AM and AM-PM nonlinearities in Fig. 5.12 will tilt with the opposite "negative" slope versus the amplitude code $i$.

On the other hand, $N$ times FET-stacking in the proposed PA architecture results in $N$ times larger standard deviation $\sigma_{R}$ of the implemented switches' on-resistance under random local variations. As a result, the AM-AM and AM-PM distortions are exacerbated by more than $N$ times, according to the model, with device stacking. Fortunately, the proposed PA not only employs unary decoded architecture but more importantly relies on multiple solid-state transformer ratios to realize higher amplitude resolutions, instead of increasing the conductance and capacitance segmentation of the employed DAC which comes at increased standard deviation in the conductance or capacitance step LSB size.

### 5.3.4 Recursive HoC Slice Architecture

The proposed PA architecture requires dynamic reconfiguration of individual slices between ratios to achieve Doherty-like back-off. It can be challenging to reconfigure without exceeding device ratings or wasting area, all while maintaining the same $R_{\text {out }}$ across all reconfiguration states to avoid AM-AM distortion. Perhaps the most straightforward solution would be to implement two parallel HoC amplifiers, each configured for the 1:1 and 1:2 ratios, respectively, and enable or disable one or the other to realize each ratio as in Fig. 5.10a. However, two main drawbacks come with this approach. First, the $V_{D S}$ device rating of the disabled amplifier can be exceeded through the high voltage output amplitude coupled through the output-side capacitor, as shown in Fig. 5.13. Furthermore, the disabled switches act as diode-connected devices, and hence loading the peaks and valleys of the output amplitude, establishing a non-linear loading and compromising the linearity. Secondly, disabling one of the amplifiers wastes almost half of


Figure 5.13: Loading of a disabled PA cell resulting in potential device voltage rating violations.
the silicon area.
To realize the high-efficiency at back-off while maintaining high linearity, a recursive reconfiguration approach is chosen in this work. Figure 5.14 illustrates the switch diagram of the implemented slice architecture used to realize the two reconfigurable transformation ratios, $1: 1$ and 1:2. Although six $V_{D D}$-rated class D PA cells are only technically necessary to implement each 2-stack 2-cascade HoC slice in Fig. 5.7a, twelve cells are used to permit recursive reconfiguration with fixed $R_{\text {out }}$, without exceeding the device ratings, and without disabling or wasting silicon area. Each recursive HoC slice is implemented through two parallel 2-stack 2-cascade HoC ladders. The intermediate nodes $V_{\text {int } 1}, V_{\text {int } 2}$, and $V_{\text {int } 3}$ are tied together in the two parallel 2-stack 2-cascade HoC ladders, while the output of each of the four 2-cascade HoC PAs is coupled to $V_{\text {out }}$ via $C_{c} / 64$ capacitors.

Each 2-cascade HoC comprises six switches. Each AC switch is assigned $G_{o n} / 2$ to realize an overall $R_{\text {out }}$ of $R_{\text {on }}$. The DC switch of the two available is allocated $G_{o n} / 8$, as discussed previously. The four switches $s 3_{1}, s 3_{3}, s 2_{2}, s 2_{4}$ includes an extra helper switch, sized to be $3 G_{\text {on }} / 8$, to enable fixed $R_{\text {out }}$ value across both ratios. In the $1: 2$ transformation ratio, all the switches are operated from the input PM clocks while the helper switches are disabled. In the 1:1 ratios, switches $s 1_{1}$ and $s 3_{1}$ in HoCl are statically turned on, while $s 5_{1}$ and $s 6_{1}$ are operated through the PM clock, to connect the class-D PA1 permanently between GND and $V_{\text {int } 1}$. By enabling the helper switch within $s 3_{1}$, a fixed $R_{\text {out }}$ (equal to $R_{\text {on }}$ ) can be realized across both 1:2


Figure 5.14: Recursive architecture of an HoC slice. The output-side stacked two DC capacitors in Fig. 5.7b are not shown for clarity.
to $1: 1$ ratios. Similarly, the switches $\left(s 1_{3}, s 3_{3}\right)$ in $\mathrm{HoC} 3,\left(s 2_{2}, s 4_{2}\right)$ in HoC 2 , and $\left(s 2_{4}, s 4_{4}\right)$ in HoC 4 are used to permanently connect PA3, PA2, PA4 at $\left(V_{\text {int } 2}, V_{\text {int } 3}\right),\left(V_{\text {int } 1}, V_{\text {int } 2}\right)$, and $\left(V_{\text {int } 3}\right.$, $V_{i n}$ ), respectively. This way the four $V_{D D}-$ swing PA cells PA1, PA2, PA3, and PA4 are stacked on top of each other, as desired by Fig. 5.10a, while operated from $V_{B A T}$ to enable high-efficiency at 6 dB back-off.

### 5.4 Circuit Implementation

### 5.4.1 Reconfigurable Class-D PA Cell Design

The 12 class-D PA cells used to implemented an HoC slice in Fig. 5.14 are divided into three categories based on the required digital conductance programability: nominal, segmented pull-up, and segmented pull-down. The segmented configurations include an additional pull-up/down helper switch over the nominal cell. Figure 5.15 a illustrates the schematic implementation of an example segmented pull-down class-D cell. The other cell configurations can be realized in a similar way. In Fig. 5.15a, a two non-overlapping clocks, $\phi_{1}$ and $\phi_{2}$, are generated

(a)

(b)

Figure 5.15: Reconfigurable class-D PA generic cell schematic (a). Placing each PA cell in a separate deep n-well (b). Floating connection enables $2 \times$ reduction in bottom-parasitics unlike the required high bias resistance used in [1].
from the received level-shifted PM signal through three-transistor inverters [92] with feedback from the opposite phase to realize minimal dead-time and eliminate any shoot-through current. Clocks $\phi_{1}$ and $\phi_{2}$ are provided through a cascaded chain of buffers to drive the gate capacitance of the NMOS $M_{n}$ and PMOS $M_{p}$ switches.

The PA conduction RMS loss stems from the load current flow through the switches' on-resistance, and hence the PA equivalent $R_{\text {out }}$. The second key loss component of the PA originates from the charging and discharging of the parasitic capacitance, once per the RF cycle, of the constituent power switches, which includes the gate, drain, and body parasitics; and the capacitors' top and bottom parasitics. Therefore, the total PA loss is set by:

$$
\begin{equation*}
P_{\text {loss }}=P_{R M S}+P_{\text {switching-transistor }}+P_{\text {switching-cap }} \tag{5.7}
\end{equation*}
$$

In order to realize the maximum possible PA power-added efficiency at a given carrier frequency $f_{o}$, input voltage $V_{i n}$, optimum resistance $R_{L}$, and for a given technology, the total PA loss $P_{\text {loss }}$ must be minimized. Figure 5.16 a shows the optimization plots of the simulated PA conduction and switching loss components associated with the PA switches, including capacitors' parasitic
switching loss, versus the switch size in low-power 65 nm CMOS. While a wider switch results in a smaller conduction loss $P_{R M S}$, it comes at a higher switching parasitic losses, and hence an optimal switch width can be found when both loss components are matched (Fig. 5.16a). However, as shown in Fig. 5.16b, the selected switch size for this design is almost 1.8 times the optimal point size. This is in order to realize an overall $R_{\text {out }}$ of $1 \Omega$ for a single-ended amplifier at $65^{\circ} \mathrm{C}$ to enable the PA to deliver above 23 dBm of total $P_{\text {out }}$ to the load $R_{L} \approx 10 \Omega$ (after a small amount of impedance transformation per Fig. 5.7a) and to achieve superior linearity, given the dominance of the on-resistance mismatch on the PA distortion, as previously discussed. Thus, the selected switch sizes are $16 \mu \mathrm{~m}$ for NMOS and $41.6 \mu \mathrm{~m}$ for PMOS in an AC class-D cell of a matched NMOS and PMOS on-resistance. Figure 5.16 c illustrates the schematic-simulated peak-amplitude PAE under the optimal switch sizing, of matched conduction and switching losses, versus $f_{o}$ for an optimum $R_{L}$ of $10 \Omega$. As shown, using the employed low-power 65 nm transistors of increased threshold voltage (and hence higher on-resistance) for low leakage, the peak PAE degrades with higher $f_{o}$ and reaches $44 \%$ at $f_{o}=3 \mathrm{GHz}$ on schematic simulations. This suggests about $24 \%$ after accounting for layout parasitics, the on/off-chip ESR, and the losses in the load transformation network as well as the employed balun including the amplitude/phase imbalance effects. Furthermore, the PA peak $P_{\text {out }}$ gets lower as $f_{o}$ is increased since the optimal switch size, in Fig. 5.16c , is reduced to realize lower parasitic switching losses.

The power switches dis/charge the top-plate of each cell's coupling capacitor, $C_{c}$, implemented between M8/M7 with a $2 \mathrm{fF} / \mathrm{mm}^{2}$ MIM capacitor. This way the bottom-plate capacitance is tuned out through the inductive bandpass filter, rather than being hard dis/charged through the PA cell. MIM capacitors instead of the denser MOS are used for $C_{c}$ for their higher precision/linearity and, importantly, their high voltage rating of 10 V . The helper transistor, $M_{h}$, is applied as a static switch through the transformation ratio control bit $T X$, level-shifted to the corresponding stacked-domain.

With process scaling, the substrate/drain diode's breakdown voltage gets lower. Stacking


Figure 5.16: Simulated overall loss optimization plots. (a) Conduction and switching loss components at $f_{o}=720 \mathrm{MHz}$. (b) Peak-amplitude PAE and $R_{\text {out }}$ versus switch size at $f_{o}=$ 720 MHz . (c) Optimal peak-amplitude PAE versus $f_{o}$.

NMOS devices with their body tied to the p-substrate would cause the topmost NMOS in the stack to block a large drain-to-body voltage (e.g., $V_{D B}$ of 4.8 V ), exceeding the breakdown voltage in deep-submicron CMOS. The implemented class-D cell is instead isolated in a separate deep-nwell
(DNW) as in Fig. 5.15 b so that the substrate p/DNW diode (which has a breakdown voltage on the order of 12 V ) blocks the large output voltage instead of the NMOS substrate $\mathrm{p} / \mathrm{n}+$ diode. This enables stacking up to twelve class-D PA cells for a 12 V maximum output voltage swing. Furthermore, this ensures fixed threshold-voltage $V_{t h}$ of the used switches during operation to ensure constant conductance and minimum distortion. To reduce switching losses, the DNW is left floating while the inner p-well is shorted to its respective flying ground to prevent latch-up. As shown in Fig. 5.15b, this effectively places the parasitic capacitors of the p/DNW and the p-well/DNW diodes in series, reducing the bottom-plate well parasitics by a factor of $\sim 2$. The cell clamping capacitance ( $C$ in Fig. 5.14) is implemented in the same DNW using a thin-oxide PMOS transistor with $12.5 \mathrm{fF} / \mu \mathrm{m}^{2}$ of density and a breakdown voltage of 1.5 V . The non-linearity of such capacitance does not affect the topologically-defined steady-state DC voltage across each 1.2 V cell. Each cell comprises $0.6 \mathrm{pF}\left(\sim 7.8 \Omega \mathrm{ESR}\right.$ at $\left.65^{\circ} \mathrm{C}\right)$ of clamping capacitance to enable automatic voltage balancing against non-fully differential signals and to provide proper decoupling for the gate drivers of the power switches. The total required clamping capacitance is 230.4 pF in the present differential implementation.

### 5.4.2 Interfacing Level Shifters

Besides the $T X$ signal that controls the helper switch in a static manner, an extra bit $E N$ is employed to clock gate (i.e. PM gate) the whole HoC slice to statically hold the slice coupling capacitor low. Therefore, each recursive HoC slice in Fig. 5.14 receives two gain setting bits (EN, $T X$ ) to establish three gain states: statically holding $C_{c}$ down $(0,0)$, switching with $V_{D D}$ swing (1, 0 ), and switching with $2 V_{D D}$ swing (1, 1). This requires shifting the voltage levels of the input PM clock and the enable signal of the helper switch $T X$ to the appropriate levels needed by all twelve PA cells. Figure 5.17 a shows the proposed Dickson shifter to achieve that. A fork-based clock tree is established through the depicted Dickson capacitor connection ( $C_{s h}=35 \mathrm{fF}$ using MIM) to distribute balanced in-phase PM signals to the initial four stacked domains in each HoC ladder.


Figure 5.17: Proposed balanced Dickson shifter (a). Generating the PM clocks of PA1, PA2, PA3, and PA4 of the recursive slice in Fig. 5.14(b).

Unlike a conventional ladder shifter which, due to the series connection of the capacitors and thereby the unequal reactances connecting the input clock to the inputs of the stacked domains can have large skew (40ps in simulation), the proposed approach achieves low skew and requires $3 \times$ less capacitance. A static latch is used to provide a low-impedance path to balance the voltage across the Dickson capacitors and enable robust operation against leakage or any coupled glitches. A $1 / 2$-sized inverter is used in the level shifter to establish a weak feedback in the latch that is easily overridden by the triggering input PM driver, thus reducing the required capacitance. The low, $V_{L}$, and the high, $V_{H}$, supplies of each latch are provided through two consecutive voltage levels from the following list: $G N D, V_{\text {int } 1}, V_{\text {int } 2}, V_{\text {int } 3}, V_{\text {in }}$; as illustrated in Fig. 5.17a.

The helper enable-signal, $T X$, can be shifted in a similar manner for each of the four switches $s 2_{2}, s 3_{1}, s 2_{4}$, and $s 3_{3}$, where the Dickson shifter operates at the envelope sample rate. The PM input of the flying cells PA1 and PA3, in Fig. 5.14, is provided through CMOS OR gates
between $\left(\mathrm{GND}, V_{\text {int } 1}\right)$ and $\left(V_{\text {int } 2}, V_{\text {int } 3}\right)$, while the inputs to PA2 and PA4 are supplied through an AND gate between $\left(V_{\text {int } 1}, V_{\text {int } 2}\right)$ and $\left(V_{\text {int } 3}, 4 V_{D D}\right)$, as shown in Fig. 5.17b When $T X=1$, i.e. the 1:2 ratio, the gate terminals of (PA1, PA2) and (PA3, PA4) are statically connected to $V_{i n t 1}$ and $V_{\text {int } 3}$, respectively. In the 1:1 ratio, the initial four stacked PA cells in the odd-ladder are statically enabled, connecting PA1 and PA3 between (GND, $V_{\text {int } 1}$ ) and ( $V_{\text {int } 2}$ and $V_{\text {int } 3}$ ), respectively, while the PM signals are allowed through the ORs to the gates of PA1 and PA3. A similar operation follows for the even-ladder. When the recursive HoC slice is deactivated through the $E N$ signal received from the thermometer decoder in Fig. 5.7a, the input PM clock is gated, enabling all the NMOS switches and statically holding the output $C_{c} / 64$ capacitors low. When reconfiguring between any two of the three states, the lead delay should be balanced by ensuring equal logic depth for the clock propagation in the 1:1 and 1:2 cases, to eliminate any AM-PM distortion.

### 5.5 Experimental Results

The proposed recursive House of Cards power amplifier is implemented in a low-power (LP) 65 nm bulk CMOS process with nine metal layers ${ }^{2}$ A die photo is shown in Fig. 5.18 with the comprising 16 recursive HoC slices annotated as well as the differential three-level H -tree for balanced PM signal distribution. The occupied area is $1.2 \mathrm{~mm} \times 1 \mathrm{~mm}$. However, the design is loosely wired and the combined area of the individual blocks is approximately $0.83 \mathrm{~mm} \times 0.58 \mathrm{~mm}$ to realize 5-bit amplitude resolution. However, higher resolutions can be achieved by further segmenting the same total conductance and capacitance resources to realize finer steps. The chip is directly mounted onto a Rogers 4003C PCB with $50 \Omega$ transmission lines for the input and output terminals. All PA cells are implemented with thin-oxide 1.2 V transistors, and yet, thanks to the novel stacking and cascading HoC structure, the PA is directly connected to a 4.8 V supply without violating any transistor voltage ratings.

[^3]

Figure 5.18: Chip micrograph.


Figure 5.19: Measurement setup.

The PA testing setup is shown in Fig. 5.19. A vector signal generator (Keysight N5182B) generates constant-envelope phase-modulated RF waveforms up to 1 GHz , while an FPGA (Xilinx Spartan 6) generates 32 bits of digital amplitude data (two bits to set the state of each of the 16 HoC slices) with up to 144 MHz of bandwidth. The differential PA outputs are then combined via an off-chip balun and measured by a spectrum analyzer (Keysight N9020A). The large digital bus from the FPGA to the chip has up to 5 ns of within-bus timing misalignment due to trace length mismatch inside the FPGA and the PCB. This limits the close-in shoulder height of the resulting spectrum to about 50 dBc , theoretically. The AM and PM generated signals are frequency synchronized through a 10 MHz reference signal, while the PATT trigger signal from the VSG aligns the frame start on the FPGA with a resolution of one sample.

When generating non-modulated (CW) signals, the PA was measured to generate up to 23 dBm of peak power at the $1: 2$ ratio, while achieving a $40.3 \%$ battery-to-RF efficiency, as


Figure 5.20: (a) Measured battery-to- $P_{\text {out }} \mathrm{PAE}$, output power $\left(P_{\text {out }}\right)$, and output voltage amplitude versus the input code of the proposed flying-domain PA with $50 \Omega$ antenna ( $f_{o}=720 \mathrm{MHz}$ ). (b) Measured DNL and INL of the proposed PA.
shown in Fig. 5.20a. The accuracy of the off-chip matching components results in $\sim 90 \mathrm{mV}$ voltage amplitude imbalance between the differential PA channels, which, in addition to the amplitude and phase imbalance of the employed balun, serves to degrade the efficiency by 4-6\% due to the non-differential DC-DC loading. A fully integrated matching can help mitigate the amplitude and phase imbalance issues. At the $1: 1$ ratio ( 6 dB backoff), the PA achieves $40.8 \%$ efficiency demonstrating the elimination of cascaded DC-DC losses from the power flow. This is in fact higher than the efficiency at the peak power due in part to the linear and quadratic scaling of the gate-drive and DNW bottom-plate parasitics of the flying domains, respectively. Thanks to the magnetic-less Doherty-like structure, the PA achieves a nearly flat backoff between the two ratios. This is unlike conventional digital Doherty implementations with high-order transformer magnetics [94], [95], [96] that suffer from $8.2 \%, 4.9 \%$, and $11 \%$ lower relative efficiency at 6 dB back-off, and that are powered from ideal low supply voltages of $3 \mathrm{~V}, 1.5 \mathrm{~V}$, and $2.4 \mathrm{~V} / 1.2 \mathrm{~V}$, respectively. Compared to an ideal class-B PA powered by an $80 \%$ efficiency DC-DC converter, the proposed PA achieves $8.3 \%$ and $24.8 \%$ higher efficiency at peak power and 6 dB backoff, respectively. As shown, the measured efficiency is in good agreement with the analytical model. Thanks to the topologically-defined, KVL constrained circuit and the unary-sized array, the proposed PA achieves excellent static linearity results: less than 0.05 LSB differential non-linearity (DNL), and less than 0.5 LSB integral non-linearity (INL), as shown in

(a)

Figure 5.21: Measured dynamic characteristics of the 16-QAM signal.

Fig. 5.20b
Figure 5.21 a shows the results of dynamic tests, where a 10 MHz 16-QAM signal was fed into the PA at a 72 MHz envelope rate. This requires strict timing-alignment between the phase and amplitude paths, ideally with sub-ns resolution. The employed measurement setup afforded an alignment accuracy of only $\sim 13 \mathrm{~ns}$, which, while not ideal, was sufficient to, with the excellent linearity of the proposed circuit, achieve an error vector magnitude (EVM) of 3.6\%-rms, as shown in Fig. 5.21a with the constellation diagram (bottom right) and the in-band power spectrum (left), all without employing any digital pre-distortion. Figure 5.22 shows the measured 10 MHz 16-QAM signal transmitted power spectrum characteristics. While achieving -31.7dBc, the close-in shoulder height in Fig. 5.22a can be further enhanced with a better time alignment between the amplitude code and PM signal. The PA achieves an average PAE of $26.5 \%$ at an average $P_{\text {out }}$ of 15.7 dBm . The aliased artifacts in Fig. 5.22b are caused by sampling the amplitude at 72 MHz , and can be reduced further through increasing the sampling frequency and/or using a higher-order filtering function. A first/second-order hold digital filter can be employed to reconstruct the continuous-time signal from the discrete samples through linear (or higher order) interpolation instead of holding each sample value for one sample interval (i.e. zero-order hold).

A transient test of a $20 \mathrm{MHz} 32-\mathrm{QAM}$ modulated signal performed using a previous


Figure 5.22: Measured spectrum, close-in (a) and far-out (b).
version of the developed test setup is shown in Fig. 5.23. At a 100 MHz envelope rate, the PA responds to 3-bit AM codeword changes (observed to be the largest change in the signal) within 2.5ns as shown in Fig. 5.23 (bottom), implying that the maximum bandwidth of the proposed design is up to 400 MHz . Generally, the implementation of the amplitude and phase modulators on-chip at high data rates is challenging and results in large area and power overheads. Table 5.1 summarizes the results of the proposed PA in contrast to prior art. The recursive house of cards PA achieves the highest power added efficiency amongst prior-art battery-connected CMOS PAs at both peak and 6 dB backoff power levels.


Figure 5.23: Measured time-domain output of the proposed PA with 32-QAM 20-MHz OFDM ( $f_{o}=720 \mathrm{MHz}$ ) (top) and measured AM step response for 6 -step change (bottom).
Table 5.1: Comparison with prior work

|  | $\begin{gathered} \text { Aoki, } \\ \text { JSSC'02 } \end{gathered}$ | Nakatani, CSICS'13 | $\begin{gathered} \text { Yoo, } \\ \text { ISSCC'11 } \end{gathered}$ | $\begin{gathered} \text { Arno, } \\ \text { ISSCC'14 } \end{gathered}$ | $\begin{gathered} \mathrm{Hu}, \\ \text { RFIC }{ }^{\prime} 14 \end{gathered}$ | $\begin{gathered} \text { Lee, } \\ \text { ISSCC'15 } \end{gathered}$ | This work |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| PA technique | Power combining | Digital <br> polar | Switched capacitor | $\begin{aligned} & \text { Env. } \\ & \text { tracking } \end{aligned}$ | Digital Doherty | $\begin{gathered} \text { Env. } \\ \text { tracking } \end{gathered}$ | Power Inverter |
| Technology | 180 nm | 150 nm | 90 nm | 130nm | 65 nm | 130 nm | 65nm LP |
| Frequency [GHz] | 1.9 | 0.85/1.75 | 2.2 | 1.747 | 3.6 | 1.747 | 0.72 |
| VBAT [V] | N/A | 3.3 | N/A | 3.7 | N/A | 4 | 4.8 |
| PA VDD [V] | 1.8 | N.R. | 1.25/ 2.5 | N.R. | 3 | N.R | N/A |
| Pout,max [dBm] | 34.5 | 27.5 / 29 | 25.2 | 26.3 | 27.3 | 26 | 23 |
| PAE @ Pout,max [\%] | 50 | N.R. | 55.2 | N.R. | 32.5 | 40 | 40.3 |
| PAE @ 6dB backoff [\%] | 27* | N.R. | 35.1 | N.R. | 22 | 28 | 40.8 |
| VBAT-to-Pout @ <br> Pout,max[\%] | N/A | 13.2 / 22.2 | N/A | 39 | N/A | N/A | 40.3 |
| VBAT-to-Pout @ 6dB backoff | N/A | N.R. / 11.1* | N/A | N.R. | N/A | N/A | 40.8 |
| INL | N/A | N/A | <3 LSB | N/A | N/A | N/A | $<0.5 \mathrm{LSB}$ |
| DNL | N/A | N/A | $<0.5$ LSB | N/A | N/A | N/A | $<0.09 \mathrm{LSB}$ |

### 5.6 Conclusion

This chapter has presented a new PA design that utilized stacked and flying class-D cells arranged in a House of Cards architecture to facilitate efficient generation of high output power, all while using low-voltage, thin-oxide CMOS transistors. Individual HoC cells were then made fully modular and reconfigurable to support different voltage conversion ratios and thus high efficiency at 6 dB back-off. By capacitively combining the outputs of HoC slices operating at different ratios, the PA could be dynamically configured to deliver high efficiency at intermediate back-off levels, exactly following a Doherty back-off profile, but without requiring any magnetic components. The proposed HoC PA was implemented in 65 nm LP CMOS, was operated directly at 4.8 V without any explicit DC-DC converter, and was shown to achieve $>40 \%$ battery-to-RF efficiency at both peak power and 6 dB back-off, while enabling linear transmission of $>10 \mathrm{MHz}$ 16-QAM signals.

### 5.7 Acknowledgements

This chapter is based on and mostly a reprint of the following publications:

- L.G. Salem, J.F. Buckwalter, and P.P. Mercier, "A recursive switched-capacitor house-of-cards power amplifier," IEEE Journal of Solid-State Circuits (JSSC), Jul. 2017, vol. 52, no.7, pp. 1719-1738.
- L.G. Salem, J.F. Buckwalter, and P.P. Mercier, "A recursive house-of-cards digital power amplifier employing a $\lambda / 4-$ less Doherty power combiner in 65 nm CMOS," in Proc. IEEE European Solid-State Circuits Conference (ESSCIRC), Sep. 2016, pp. 189-192.


## Chapter 6

## Adiabatic Clocking

### 6.1 Introduction

Clock distribution in modern SoCs consumes a significant fraction of total chip power. To reduce clock distribution power, resonant clocking schemes, where an inductive reactance is used to cancel the capacitive reactance of global clock networks at a given resonance frequency, $f_{o}$, have been proposed. Conventionally, such schemes are only suitable at high multi-GHz frequencies in order to be able to place the employed inductors on chip [97, 98]. Since many modern energy-efficient SoC designs optimize for clock frequencies $<2 \mathrm{GHz}$, with dynamic voltage and frequency scaling (DVFS) techniques bringing the core clock frequencies and the supply voltages $V_{D D}$ to the MHz and near-threshold regimes, respectively, there is a need to develop low-power clock distribution schemes that can work across increasingly wider operating ranges. While recent work in quasi-continuous resonant clocking have been proposed to intermittently cancel global clock tree capacitance during edge transitions, such techniques require large off-chip inductors and are limited to 0.98 MHz [99] and 150 MHz [100], respectively, owing to the need to operate well below resonance (i.e., $\ll f_{o} / 10$ ). Thus, while prior-art has shown power reduction for targeted applications, they all require large on- or off-chip magnetics, and do not meet the

MHz-to-GHz frequency-range needs of modern DVFS-enabled SoCs. To address these problems, this chapter introduces a fully-integrated adiabatic clocking scheme that efficiently synthesizes n -step clock waveforms from 1 MHz to 2 GHz via a switched-capacitor DC-AC multi-level inverter topology, theoretically reducing power by $1 / n$ without using any magnetic component.

### 6.2 Challenges of Resonant Clocking

Fig. 6.1 illustrates prior resonant clocking techniques, including intermittent resonant clocking (IRC) [3] and quasi-resonant clocking (QRC) [4] schemes. Conventional approaches utilize an array of on-chip inductors along with per-inductor decoupling capacitor ( $>10 \times C_{C L K}$ ). Unfortunately, clock power increases $\sim \pm 20 \%$ away from resonance ( $f_{o}$ ), thereby limiting DVFS opportunities. On the other hand, IRC and QRC techniques can enable DVFS up to $\sim f_{o} / 10$ by employing large off-chip inductors as in Fig. 6.1 (right). However, such approaches can have severe ringing if accurate pulse width timing is not ensured, thereby requiring power-expensive timing logic overhead, e.g., delay lock loops (DLLs). Furthermore, special gate drivers or charge pumps are required to either boost the gate drive voltage of the footer NMOS in IRC techniques, or provide a $V_{D D} / 2$ gate drive for QRC footer transistor Mf, to ensure that it turns off before its drain voltage goes to $V_{D D} / 2$ (which is a further device reliability issue).

### 6.3 Adiabatic Switched-Capacitor Driver

In contrast, clock power is reduced in the proposed approach through an adiabatic stepwise charging technique implemented using a 4-level switched-capacitor DC-AC inverter topology, shown in Fig. 6.1. In this scheme, the CLK capacitance ( $C_{C L K}$ ) is step-wise charged by sequentially turning on switches $\mathrm{s} 1, \mathrm{~s} 2, \mathrm{~s} 3$, and s 4 , which creates a 4-level voltage staircase whose levels are set by GND, self-balanced capacitors C 1 and C 2 , and $V_{D D}$. Afterwards, CLK is brought down


Figure 6.1: Prior inductive resonant clocking techniques (top); the proposed switched-capacitor multi-level adiabatic clocking technique (bottom).
to GND in the reverse order. Theoretically, 4-level adiabatic charging reduces CLK power by $3 \times$. By repeating the same operation periodically at $f_{C L K}$, a KVL-constrained multi-phase switched network is established which inherently enforces $V_{L 2}=V_{D D} / 3$ and $V_{L 3}=2 / 3 V_{D D}$ without any explicit DC-DC converter.

### 6.4 Circuit Implementation

The proposed reconfigurable 4-level inverter, shown in Fig. 6.2, is composed of two standard CMOS inverters whose outputs are tied together: an outer inverter powered between
$V_{D D}$ and GND, and an inner inverter with a floating supply and ground at $2 / 3 V_{D D}$ and $1 / 3 V_{D D}$, respectively. The outer inverter is controlled by signals P and N , periodically connecting its output to $V_{D D}$ or GND, while the inner inverter is controlled by signals Pi and Ni , periodically connecting its output to $2 / 3 V_{D D}$ or $V_{D D} / 3$. These control signals are generated by passing the input clock through a tunable-delay chain of inverters, producing three signals, $\mathrm{A}, \mathrm{B}$, and C , with equal delay times, $\Delta \mathrm{t}$, and passing these signals through a custom House-of-Cards (HoC) timing gate (whose operation is logically represented by eight combinational gates in Fig. 6.2). To enable adiabatic charging, the switchs $R_{o n} / W_{s} w$ should be set such that the $R C_{C L K}$ time constant, $\tau$, is less than $\Delta t / 1.4$. The 4-level inverter requires a total of $6.7 C_{C L K}$ of self-balancing capacitance, which is $1.8 x$ lower than the capacitance required in conventional resonant schemes. The 4-level inverter can be reconfigured into a 3-level inverter by overlapping pulses $\mathrm{Ni}, \mathrm{Pi}$, as shown in Fig. 6.2, coarsely decreasing the $10-90 \%$ rise/fall time from $0.8 \times 3 \Delta \mathrm{t}$ to $0.8 \times 2 \Delta \mathrm{t}$; fine rise/fall time configuration can be adjusted via the tunable delay chain. The 4-level inverter can also be reconfigured into a standard 2-level CMOS inverter by disabling the inner inverter.

The actual implementation of the custom HoC timing gate, optimized to generate N , $\mathrm{Ni}, \mathrm{Pi}$, and P with non-overlapping properties in minimal area and power, is shown in Fig. 6.3. Non-overlapping pulses are inherently generated in the HoC gate since, when the leaves of the HoC tree turn on, the output pulses must wait until the common root in the tree is charged or discharged. For instance, suppose that $\mathrm{ABC}=110$, thereby $\bar{N} i=0$. Then, if C transitions from 0 to 1, Cp 1 and $\mathrm{Cp} 2(\mathrm{Cp} 3$ and Cp 4$)$ are already discharged (charged) when the $\mathrm{C} / \bar{C}$ edge arrives, and hence all controlling pulses ( $\mathrm{N}, \mathrm{Ni}, \mathrm{Pi}, \mathrm{P}$ ) are synchronized without overlap. The HoC gate can be folded to support 4-, 3-, or 2-level timing signals via configuration bits R1 and R0.

The overall architecture of the adiabatic clocking prototype is shown in Fig. 6.3. A 4b programmable-strength reconfigurable 4-level inverter is implemented, where all 16 slices share the same VL2 and VL3 nodes, each connected to 50 pF of on-chip thick-oxide capacitance. An on-chip current starved oscillator locked through an off-chip phase locked loop (PLL) is employed


Figure 6.2: Circuit schematics and timing diagrams of the 4-level inverter, showing how it can be reconfigured into 3- and 2-level modes.
as the clock source. To ensure sufficient rise/fall time for adiabatic operation up to 2 GHz , phases $\mathrm{A}, \mathrm{B}, \mathrm{C}$ are provided from the first 3 stages in the 5 -stage ring oscillator such that the adiabatic CLK $10-90 \%(20-80 \%)$ rise/fall time is $\ll 24 \%$ ( $\ll 18 \%$ ) of the CLK period. The 4-level inverter drives a 32 x pipelined array of 64b MACs. Capacitance from digital logic, CLK wiring, and drain parasitics of the driver totals $C_{C L K}=15 \mathrm{pF}(2: 1: 1)$.

Fabricated in 9M 45nm silicon-on-insulator (SOI), the designed global clock distribution, spanning a load area of $550 \times 550 \mu \mathrm{~m}^{2}$, takes the form of a tree-driven grid. The clock tree and grid (as well as the power distribution) occupy the top 2 ultra-thick (UT) metals M9 and M8, respectively. Each line of the 5-level H-tree is split into multiple fingers as shown in Fig. 6.3 to reduce inductance and enable rigid operation up to 10 GHz . The adiabatic driver, including


Figure 6.3: Schematic of the custom HoC timing gate (top); architecture of the implemented test chip (bottom).
self-balancing capacitors, occupies only $0.0187 \mathrm{~mm}^{2}(<6.2 \%$ of the load area). To quantify the improvement over conventional clocking, the driver is configured into the 2-level mode with reduced drive strength for identical rise/fall time to the 3/4-level modes, while multi-level overhead circuits are off.

### 6.5 Measurement Results

Measurement results at 1 V in Fig. 6.4 indicate 4-level (3-level) clock power savings of at least $42 \%$ ( $28.4 \%$ ) from $10 \mathrm{MHz}-2 \mathrm{GHz}$ while successfully operating a digital load, with $55.6 \%(45.5 \%)$ peak savings at 10 MHz where adiabatic clocking overhead is minimal. At 0.4 V
near-threshold operation, 4-level (3-level) clocking successfully achieves a measured power savings of at least $34.4 \%$ ( $22.5 \%$ ) from $1 \mathrm{MHz}-267 \mathrm{MHz}$. Figure 6.4 also shows the measured CLK driver energy under DVFS operation between $0.4-1 \mathrm{~V}$, showing above $39.4 \%$ savings across the entire DVFS range, with $46.5 \%$ peak savings. Figure 6.5 shows the measured power savings across all possible voltages and frequencies, indicating a $41.8 \%$ average savings across a 2000 times dynamic frequency range. The measured transient waveforms of the 4-level operation at 10 MHz from a 1 V supply is shown in Fig. 6.5, via both a common-source PMOS analog buffer (open-drain driver) biased by $25 \Omega$ ( $50 \Omega$ on PCB and $50 \Omega$ input of a sampling scope) for $0.75 \mathrm{~V} / \mathrm{V}$ gain, and a cascaded inverter chain. Figure 6.6 compares the proposed design to the state-of-the-art clocking schemes, demonstrating the widest adiabatic frequency and supply voltage dynamic ranges with the highest clock power savings, all with minimal overhead. A die photo is shown in Fig. 6.7.

### 6.6 Acknowledgements

This chapter is based on and mostly a reprint of the following publication:
L.G. Salem and P.P. Mercier, "A 0.4-1V 1MHz-to-2GHz switched-capacitor adiabatic clock driver achieving $55.6 \%$ clock power reduction," 2017 IEEE International SolidState Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2017, pp. 442-443.


Figure 6.4: Measured clock power improvement of 4- and 3-level clocking compared to conventional clocking at 1 V and 0.4 V across frequency (left); measured CLK energy-per-bit improvement of the 4-level inverter across supply voltages.


Figure 6.5: Measured power savings through 4-level clocking from $1 \mathrm{MHz}-2 \mathrm{GHz}$ and $0.4-1 \mathrm{~V}$, achieving an average efficiency of $41.8 \%$ across the entire space (top); measured transient waveforms of the 4 -level adiabatic clock at $(10 \mathrm{MHz}, 1 \mathrm{~V})$.

|  | ISSCC'04 [1] | ISSCC'13 [3] | ISSCC'14 [2] | ISSCC'16 [4] | This Work |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Process | 90 nm | 40 nm | 22nm SOI | 65nm | 45 nm SOI |
| Power reduction methodology | LC resonance | LC resonance | LC resonance | LC resonance | Step-wise charging |
| Inductor | four 1nH per CLK sector (on-chip) | $\begin{gathered} 7 \mu \mathrm{H} \\ \text { (off-chip) } \end{gathered}$ | 57 CLK sectors, each with 2 inductors from $0.3-2.5 \mathrm{nH}$ (on-chip) | $\begin{gathered} \sim 7 \mathrm{nH} \\ \text { (off-chip) } \end{gathered}$ | None |
| Area | N.R. | N.R. | N.R. | $\begin{gathered} 0.04 \mathrm{~mm}^{2} \\ \text { (not including } \\ \text { inductor) } \end{gathered}$ | $\begin{gathered} 0.019 \mathrm{~mm}^{2} \\ \left(6.2 \% \text { of } A_{\text {LOAD }}\right) \end{gathered}$ |
| Power reduction | 20\% | 36\% | 25-33\% | $32-47 \%$ | $\begin{gathered} 26.2-55.6 \% \\ \text { (average }=41.8 \% \text { ) } \end{gathered}$ |
| Frequency range with power reduction | $2.6-4.6 \mathrm{GHz}$ | $0.98 \mathrm{MHz}{ }^{*}$ | $2.5-5 \mathrm{GHz}$ | $10-152 \mathrm{MHz}$ | $1 \mathrm{MHz}-2 \mathrm{GHz}$ |
| Dynamic frequency range | 1.8x | N/A | 2 x | 15.2x | 2,000x |
| Voltage range | 1V | 0.37V** | $\begin{gathered} \hline 0.75-1.08 \mathrm{~V} \\ (1.44 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} \hline 0.7-1.2 \mathrm{~V} \\ (1.7 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} 0.4^{* *}-1 V \\ (2.5 \mathrm{x}) \end{gathered}$ |
| Duty-cycle control | No | No | Limited | Yes | Yes |
| Clock capacitance | 7.5pF | 10pF | N.R. | N.R. | 15pF |
| CLK granularity | Global only | Global only | Global only | Global only | Global + local |

* Low frequency range not reported N.R. $=$ Not Reported
**Near-threshold operation
Figure 6.6: Comparison of the proposed adiabatic clocking scheme versus resonant clocking implementations.


Figure 6.7: Micrograph of the fabricated chip.

## Part III

## Fine-Grain Power Management

## Chapter 7

## A Recursive Digital Low-Dropout Voltage

## Regulator

### 7.1 Introduction

Modern sub- or near-threshold SoC designs feature multiple power domains to dynamically track the maximum energy efficiency point ( 62 mV -to- 0.45 V [101, 6, 102, 103, 104 , [105, 106, 107]) in response to application demands. Analog low drop-out (LDO) regulators $[108,109,110,111,112,113,114,91]$ can generate such voltages in a small area with rapid response times (e.g. TR=0.65ns [91]). However, the input voltage, $V_{i n}$, is typically brought on-chip via either a high-efficiency switching DC-DC converter or an external harvesting source, both with low-voltage sub- or near-threshold outputs (e.g., 0.5 V ), where analog LDOs have difficulty operating due to low voltage headroom. On the other hand, Digital LDOs (DLDOs) [115, 116, 117, 118, 119, 120, 121, 122, 123, 124], which replace a single saturated PMOS power transistor with an array of PMOS power transistors operating in the linear region, can operate down to 0.5 V since less headroom is required. Most switch-array based DLDOs rely on an integral controller to linearly search (via a 1-bit ADC) for the PMOS array conductance that


Figure 7.1: (a) Block-level diagram of the proposed RLDO. (b) Illustrative $V_{\text {out }}$ response to a 0 -to- $V_{\text {ref }}$ step when the rate of change of $V_{\text {out }}$ is much faster than the clock frequency, $f_{C L K}$, to explicitly show the binary search process.
realizes the nearest output voltage, $V_{\text {out }}$, to the desired target, $V_{\text {ref }}$. For $N$-bit control, linear search can take up to $2^{N}$ cycles. To expedite conversion, prior work has suggested adding a proportional term via a multi-bit ADC [120, 121]. However, the achievable response times are $44-4000 \mathrm{~ns}$ [120, 121], which is not sufficient for many digital loads.

Intuitively, binary search, illustrated in Fig. 7.1a, can find the required array conductance in exponentially shorter time, i.e., $O(N)$ cycles, and therefore can enable exponentially faster response, $T_{R}$, and settling, $T_{S}$, times (Fig. 7.1b) than linear search. Unfortunately, a DLDO employing a binary search algorithm suffers from large staircase overshoots/undershoots during the N-step binary search, along with large steady-state error, as will be discussed in Sections 7.3 and 7.5 . These challenges rendered binary-search control impractical as a main regulation scheme ever since the first introduction of switch-array DLDOs in [115, 116].

To enable a pragmatic binary-search-based DLDO [81], the large staircase overshoots and undershoots during binary-search steps are avoided by operating from a clock whose frequency is faster than the time-constant of the load. Since doing so nominally renders a DLDO fully-unstable due to the erroneous successive decisions, stability is obtained in the proposed design via a variable-coefficient proportional-derivative compensation scheme described in Section 7.3. To eliminate steady-state errors, a hysteretic PWM control scheme is proposed in Section 7.4 that also
enables sub-LSB load-current regulation. Additionally, loop-interruption logic is implemented in Section 7.5 to avoid large overshoot/undershoot that can result when a sudden $I_{L}$ or $\Delta V_{\text {in }}$ step change occurs in the middle of the binary-search (after deciding the first few most-significant-bits (MSBs)).

### 7.2 Successive-Approximation Digital LDO Topology and Operation

### 7.2.1 SAR LDO Architecture

A conceptual 7-bit binary-search DLDO is illustrated in Fig. 7.1a. The array of $2^{N}$ equal-size PMOS fingers in barrel-based linear-search DLDOs [116, 125] is replaced with $N$ binary-weighted PMOS switches (PMOS DAC), with a total conductance $G$, controlled via a SAR register. An additional switch of weight equal to the LSB is included for duty control (described later in Section 7.4.

## $V_{r e f}$-Step Transient Response

Figure 7.1b shows the transient $V_{\text {out }}$ response of representative 3-bit linear-search and binary-search DLDOs to a 0 -to- $V_{\text {ref }}$ input reference step (only 3 bits are shown for illustrative simplicity). For illustration purposes only, $f_{C L K}$ is set be slower than the rate of change of $V_{\text {out }}$ in this example to allow enough time for $V_{\text {out }}$ to settle before taking each bit-decision. Here, the PMOS array is initially turned off. At the first positive edge after $V_{\text {ref }}$ step-increase, the comparator output goes to high, and hence under binary search the MSB switch is turned on. As a result, $V_{\text {out }}$ starts to increase from zero initial voltage until it settles at the corresponding $G / 2$ voltage level. At the following positive clock, $V_{\text {out }}$ is still less than $V_{\text {ref }}$, therefore, the next MSB switch in the array is turned on which increases $V_{\text {out }}$ to the $\frac{3}{4} G$ voltage level. At the third clock
cycle, $V_{\text {out }}$ is larger than $V_{\text {ref }}$, and therefore, the SAR register turns off the second bit, $B(1)$, while simultaneously turning on the LSB.

As illustrated in Fig. 7.1b, a binary-search DLDO achieves $2^{N} / N$ faster settling time than baseline linear-search DLDOs. Furthermore, the SAR controller requires only $N$ DFFs instead of $2^{N}$ DFFs for a barrel shifter, reducing control area by $2^{N} / N$. Along with a $2^{N} / N$ reduction in the number of cycles to reach $V_{r e f}$, clock power is thus reduced by $N^{2} / 2^{2 N}$. Importantly, a binary-search DLDO does not suffer from limit-cycle oscillations as in baseline linear-search DLDOs, enabling $1 / 2^{N}$ lower quiescent current, $I_{Q}$, and hence higher current efficiency, where the entire SAR controller is clock-gated and only a single DFF is clocked during duty-control, as discussed later in Section 7.4. Prior schemes [124, 122] have been proposed to reduce $I_{Q}$ of linear search, however, they come at a reduced DC accuracy, as will be discussed shortly. On the other hand, to reduce area and quiescent power in a linear-search DLDO, it is in principal possible to replace the $2^{N}$ barrel shifter by an $N$-bit up/down counter driving binary-sized PMOS fingers. While this modified architecture reduces the number of required DFFs, and hence quiescent power (by $2^{N} / N$ ), it results in significant periodic noise in the on-chip supply lines due to limit cycle oscillation of the binary-weighted, as opposed to unary-sized, PMOS fingers.

## $I_{L, \text { max }}$-Step Transient Response

Figure 7.2a illustrates the transient $V_{\text {out }}$ response of representative 7-bit linear-search and binary-search DLDOs to a full-range 0 -to- $I_{L, \max }$ load step. Compared to a baseline linear-search DLDO which turns on only a single finger (with an LSB conductance of $G / 2^{7}$ ) after the load step, the SAR architecture turns on half of the total array conductance, $G / 2$, after the first clock cycle. In the second clock cycle, the RLDO turns on an additional quarter of the array conductance, $G / 4$, and so on in a binary-subsiding manner. The proposed SAR architecture therefore requires only $N$ clock cycles to respond to a full load step compared to $2^{N}$ in a baseline linear-search DLDO, enabling a much smaller response time and $\Delta V_{\text {droop }}$. However, in reality, the output voltage droop,


Figure 7.2: Transient $V_{\text {out }}$ response of a 7-bit linear-search and binary-search DLDOs to 0-to$I_{L, \max }$ load step. (a) close-in. (b) Far-out. Both LDOs have the same total array conductance, $G$, where $G \Delta V_{\text {drop-out }}$ matches $I_{L, \max }$.
$\Delta V_{\text {droop }}$, increases the finger current to $G / 2^{N}\left(\Delta V_{\text {drop-out }}+\Delta V_{\text {droop }}\right)$ due to linear-mode operation of the PMOS transistors, and hence the DLDO response time is a fraction, $\alpha<1$, of the simplified current-source case: $T_{R, D L D O}=\alpha_{1} 2^{N} T_{c l k} ; T_{R, R L D O}=\alpha_{2} N T_{c l k}$, as shown in Fig. 7.2b.

### 7.2.2 Performance Comparison: Speed-Power Trade-off Improvement via SAR Control

One of the main design objectives of a DLDO is to minimize the output voltage droop to a sudden load step. Nominally, consuming $\mathrm{K} \times$ higher $I_{Q}$ (e.g., via a higher $f_{C L K}$ in an DLDO) enables $\mathrm{K} \times$ faster $T_{R}$, and hence the speed-power product is fixed for a given architecture. Therefore, the product of $T_{R}$ and normalized quiescent current is employed as a figure-of-merit


Figure 7.3: FOM of a linear-search DLDO and a binary-search RLDO normalized to the analog LDO FOM for the same process technology, $I_{Q} / \Delta I_{L}$, and $C_{\text {out }}$ versus the required resolution $N$.
(FOM) [8] for a normalized comparison among various LDOs:

$$
\begin{equation*}
F O M=T_{R} \hat{I}_{Q}=\frac{C_{o u t} \Delta V_{\text {droop }}}{I_{L, \max }-I_{L, \min }} \frac{I_{Q}}{I_{L, \max }-I_{L, \min }}, \tag{7.1}
\end{equation*}
$$

where the load step test is performed between $I_{L, \min }$ and $I_{L, \max }$ with a load rise/fall time less than $T_{R} / 10^{1}$ and $I_{Q}$ is the quiescent current incurred during the periodic load swing test and not the best case $I_{Q}$. A binary-search DLDO inherently achieves a $2^{2 N} / N$ smaller (i.e., better) FOM than a baseline linear-search DLDO due to the $2^{N} / N$ faster response time and the reduced quiescent power afforded by the $1 / 2^{N}$ lower number of switching DFFs at steady-state. Figure 7.3 illustrates the FOMs for an analog LDO, and linear- and binary-search DLDOs.

[^4]
### 7.3 Variable Coefficients Proportional-Derivative Compensa-

## tion

In order to filter out the unacceptably large staircase overshoots and undershoots of the SAR conversion steps in Fig. 7.1b, it is necessary to take the SAR decisions at a faster rate than the rate of change of $V_{\text {out }}$. However, this can lead to erroneous successive decisions during the SAR conversion and hence an unstable response. This is why prior-art binary search architectures were not practically feasible [126, 127]; instead, prior-art SAR switching was utilized only as a mid sub-array helper DLDO [128]. In order to enable a pragmatic SAR-based LDO, a proportionalderivative (PD) compensation scheme is proposed to establish a multi-rate fast-slow control loop that observes (samples) $V_{\text {out }}$ at a fast clock speed, $f_{C L K}$, to eliminate the SAR conversion overshoots and undershoots, while allowing the integrator to accumulate at a rate close to the output pole frequency, $f_{c} \sim f_{L}$ to avoid instability.

### 7.3.1 Stability Analysis of DLDOs using a Bode Diagram Approach

Figure 7.4a illustrates a piece-wise-linear small-signal AC model of a binary-search DLDO without PD compensation in the Z-domain. The initial load conductance, $G_{L}=1 / R_{L}$ and the initial (i.e., prior to the $i^{\text {th }}$ iteration) PMOS array conductance, $G_{p}(i-1)$, in Fig. 7.4a establish the DC operating point around which the AC response to the input small-signal disturbance, $\Delta V_{r e f}$, is evaluated.

The comparator samples and quantizes the input error signal $e(t)=V_{\text {ref }}-V_{\text {out }}(t)$. At the beginning of the conversion, the SAR controller has the highest gain (via switching the MSB conductance) to facilitate a rapid response time via a large loop bandwidth. As the SAR algorithm converges, the step size at the $i^{\text {th }}$ iteration decreases (i.e., $M(i) \times \mathrm{LSB}=G / 2^{i}$ ), and thus the gain, $K(i)$, decreases in a binary subsiding manner, until the gain reaches the LSB value. For the purpose of stability analysis, the SAR register is assumed to accumulate the instantaneous error,
$M(i) e[k]$ rather than $M(i) e[i]$, to determine the number of turned-on PMOS fingers, $B[k]$.
The zero-order-hold equivalent of the output RC network, comprising $G_{p}(i-1), R_{L}$, and $C_{\text {out }}$, can be found as $\left(1-z_{L}\right) /\left(z-z_{L}\right)$, where $z_{L}=e^{-f_{L} / f_{C L K}}$ is the output pole, which is determined by the ratio between the equivalent output-load frequency, $f_{L}=\left(G_{L}+G_{p}(i-1)\right) / C_{\text {out }}$, and the DLDO sampling clock, $f_{C L K}$. Therefore, the open-loop transfer function of such a secondorder feedback control loop is given by

$$
\begin{equation*}
G(z)=\frac{K(i)\left(1-z_{L}\right) z}{(z-1)\left(z-z_{L}\right)} \tag{7.2}
\end{equation*}
$$

and hence $G(z)$ has two poles: the loop integrator pole on the unity circle $(z=1)$, and the output pole, $z_{L}$.

It can be shown that the open-loop gain $G(s)$ of the corresponding continuous-time system using impulse-invariance transformation [129] of the digital LDO in (7.2] is given by

$$
\begin{equation*}
G(s)=\frac{\omega_{n}^{2}}{s\left(s+2 \eta \omega_{n}\right)} . \tag{7.3}
\end{equation*}
$$

Therefore, the open-loop gain $G(s)$ contains two poles, where the Z-domain poles, $z=1$ and $z_{L}$, map to, the $s=0$ and $f_{L}$, in the S-domain. Figure 7.4billustrates the Bode diagram of a DLDO, where the integrator pole $(s=0)$ asymptote intersects the $0-\mathrm{dB}$ axis at $f_{I}=\omega_{n} /(2 \eta)$. It can be shown that $f_{I}=\beta f_{C L K} \ln (M(i) \times \mathrm{LSB}+1)$, where $\beta$ is a proportionality factor less than unity. Therefore, for a given output load frequency, $f_{L}$, increasing the sampling frequency, $f_{C L K}$, or the array conductance step, $M$, both serve to increase $f_{I}$, which shifts the magnitude plot upwards, as shown in Fig. 7.4b. This boosts the unity-gain frequency or loop bandwidth, $\omega_{G C}=\sqrt{f_{L} f_{I}}$, and hence enables a faster response time, however, at a reduced phase margin (PM) and stability as set by: $P M=90^{\circ}-\tan ^{-1} \sqrt{\frac{f_{I}}{f_{L}}}$. This speed-stability trade-off sets the allowable values for the design variables $f_{C L K}$ and $M$, and hence the upper bound on the achievable response time for a given design. The speed-stability trade-off becomes tighter with reduced current values: at a
fixed $f_{C L K}$ and $M, f_{L}$ becomes more dominant with smaller loads, which increases the relative separation between the two poles ( $f_{L}$ and $f_{I}$ ) and hence reduces the phase margin and eventually results in an oscillatory response at light loads, as in Fig. 7.5a.


Figure 7.4: RLDO model. (a) Small-signal AC model. (b) Bode diagram. (c) RLDO with PD controller. The SAR controller acts as a variable-gain discrete-time integrator.

In theory, loop stability can be ensured by guaranteeing enough phase margin at each conversion step, $i$, in the piece-wise linear model. To maintain a fixed integrator asymptote crossing $\left(f_{I}\right)$ and ensure stable operation, $f_{C L K}$ should be linearly reduced, from the LSB stable clock rate, $f_{C L K, L S B} /(N-i+1)$, every iteration $i$. Unfortunately, in order to avoid significant overshoots and undershoots during the initial SAR conversion steps, $f_{C L K}$ should be scaled exponentially the other way around in a binary increasing manner, i.e., $2^{i} f_{C L K, L S B}$, such that $f_{C L K, M S B}$ at the first iteration is faster than the MSB slew rate, $\mathrm{MSB} \times \Delta V_{\text {drop-out }} / C_{\text {out }} \times 1 / \Delta V_{o v}$, where $\Delta V_{o v}$ is the allowed overshoot magnitude, or $1 /\left(2 T_{R}\right)$, for $\Delta V_{o v}=\Delta V_{d r o o p}$. Therefore, a

DLDO incorporating binary search is inherently unstable, or otherwise, provides an output with unacceptably large overshoots and undershoots.

### 7.3.2 Adaptive Zero Insertion through Variable-Coefficients PD Compensation

In order to enable a pragmatic SAR-based LDO, a zero is added at $f_{L}$ to the open-loop transfer function in (7.3) through an adaptive Proportional-Derivative compensation scheme, converting the LDO into a first order system.

## Proportional Derivative Control Law and Action

Intuitively, the barrel shifter in a linear-search DLDO should ideally be locked once the output voltage reverses the direction of its slope and starts to increase towards $V_{\text {ref }}$ (e.g., after turning on three fingers in Fig. (7.6) to avoid overshoot. In other words, the loop integrator should be incremented ( +1 state) only when $V_{\text {out }}$ has a negative slope (derivative term) while $V_{\text {out }}$ is less than $V_{\text {ref }}$ (proportional term). Similarly, the loop integrator should be decremented ( -1 state) only when $V_{\text {out }}$ is trending upwards, i.e., with a positive slope, while $V_{\text {out }}$ is larger than $V_{\text {ref }}$. Otherwise, the loop integrator value should be kept fixed ( 0 state). This behavior can be described by the control law of the proposed PD compensator as:
if $\left(V_{\text {out }}[k]<V_{\text {ref }}\right) \&\left(d V_{\text {out }} / d t<0\right)$ increment; elseif $\left(V_{\text {out }}[k]>V_{\text {ref }}\right) \&\left(d V_{\text {out }} / d t>0\right)$ decrement;

The illustrated proportional-term logic in (7.4) can be implemented through the quantized error voltage, $e[k]=\left(V_{\text {ref }}-V_{\text {out }}[k]\right)$, which is +1 when $V_{\text {ref }}>V_{\text {out }}$, and -1 when $V_{\text {out }}>V_{\text {ref }}$. The derivative term logic in (7.4) can be evaluated through the difference $(e[k]-e[k-1])$ which is equivalent to $\Delta V_{\text {out }}[k]=\left(V_{\text {out }}[k-1]-V_{\text {out }}[k]\right)$. Therefore, the PD control law in (7.4) can be implemented through the addition of the aforementioned two terms as $u[k]=K_{P} e[k]+K_{D} \Delta V_{\text {out }}[k]$,


Figure 7.5: (a) Transient $V_{\text {out }}$ simulations of a 7-bit RLDO with the proposed PD compensation, and a 7-bit DLDO at peak current $R_{L}=1 \Omega$ and at light current $R_{L}=25 \Omega\left(V_{\text {in }}=2 \mathrm{~V}, V_{\text {ref }}=1 \mathrm{~V}\right.$, $G=5 \Omega^{-} 1, C_{\text {out }}=1 /(2 \pi)$ ). (b) Simulations of the PD-compensated RLDO at $R_{L}=100 \Omega$ and $R_{L}=1000 \Omega$.

Table 7.1: PD Control Action

| P term | D term | PD output |
| :---: | :---: | :---: |
| $\frac{1}{2}\left(V_{\text {ref }}-V_{\text {out }}[k]\right)^{*}$ | $\frac{1}{2}\left(V_{\text {out }}[k-1]-V_{\text {out }}[k]\right)^{*}$ | $u[k]$ |
| $+1 / 2$ | $+1 / 2$ | +1 |
| $+1 / 2$ | $-1 / 2$ | 0 |
| $-1 / 2$ | $+1 / 2$ | 0 |
| $-1 / 2$ | $-1 / 2$ | -1 |

where $u[k]$ is the PD output that is provided to the loop integrator, (i.e., barrel shifter or SAR register), $K_{P}$ and $K_{D}$ are the proportional and derivative coefficients, and $\Delta V_{\text {out }}[k]$ is the derivative term. Table 7.1 illustrates the PD output across the possible values of the proportional and derivative terms.

## Stability Improvement with PD Compensation

To illustrate the effect of the proposed PD compensation on the LDO frequency response, PD compensation is incorporated in the small-signal model in Fig. 7.4c. Using a backwarddifference approximation to the differentiation operator, $d e(t) / d t$, the discrete equivalent of a continuous-time PD compensator, $K_{P} e(t)+K_{D_{C}} d e(t) / d t$, can be found as $K_{P} e[k]+K_{D_{C}}(e[k]-$
$e[k-1]) / T_{C L K}$. Therefore, the PD compensation inserts a zero at $-K_{P} / K_{D_{C}}\left(\right.$ or $\left.-K_{P} / K_{D} \times f_{C L K}\right)$ to the corresponding continuous-time open-loop transfer function in (7.3). Unfortunately, conventional series PD compensators have fixed coefficients ( $K_{P}$ and $K_{D}$ ) at run time. Consequently, the added zero cancels the phase lag of the output pole $f_{L}=2 \eta \omega_{n}$ only when $-K_{P} / K_{D_{C}}$ is more dominant than $f_{L}$ (high currents), limiting the achievable $I_{L}$ dynamic range. On the other hand, in this work, each P and D term is individually quantized before the addition, enabling adaptation of the inserted zero with the output pole while ensuring the P and D terms have equal weights in the final output.

To illustrate the difference in the control action between the proposed and the conventional PD compensator, consider the case when $V_{\text {out }}$ is much less than $V_{\text {ref }}$, yet is slowly approaching $V_{\text {ref }}$. In this case, the conventional PD outputs +1 , since the D term, $\Delta V_{\text {out }}[k]$, is negative with a magnitude much less than the positive P term, $\left(V_{\text {ref }}-V_{\text {out }}[k]\right)$. On the other hand, the proposed PD output is zero, since the P and D terms are quantized individually before the addition, and hence the P term is +1 while the quantized D term is -1 .

To account for the quantization effect, an input, $e[k]$, dependent quantization gain is added to the P and D coefficients such that $K_{P}=k_{p} \times k_{Q p}(e[k])$ where $\left|k_{Q p} \times e[k]\right|$ is 1 , and similarly, $K_{D}=k_{d} \times k_{Q d}(e[k])$ where $\left|k_{Q d} \times \Delta e[k]\right|$ is 1 . Therefore, $k_{Q p}$ is simply $1 /\left|v_{\text {out }}[k]\right|$, where $V_{\text {ref }}$ is set to zero under small-signal operation. The D-term difference $e[k]-e[k-1]$ can be evaluated by $\left(V_{\text {out }}[k-1]-V_{\text {out }}[k]\right)$ which is equivalent to $\left(G_{p}(i)+G_{L}\right) v_{\text {out }}[k] / C_{\text {out }} \times T_{C L K}$ or essentially $f_{L} / f_{C L K} \times v_{\text {out }}[k]$, for sufficiently small values of $T_{C L K}$. Therefore, the D-term quantization gain $k_{Q d}$ becomes $f_{C L K} /\left(f_{L} v_{\text {out }}[k]\right)$ so that $\left|k_{Q d} \times \Delta e[k]\right|$ is 1 . Consequently, the resulting zero $-K_{P} / K_{D_{C}}$ in $\sqrt{7.3}$ becomes $-f_{L}$, which perfectly cancels the output pole and enables a single-pole system with phase margin of $90^{\circ}$, as shown in in Fig. 7.4b, irrespective of $C_{\text {out }}, R_{L}, M(i)$, and $f_{C L K}$. Figure 7.5 b verifies the proposed PD compensation efficacy in realizing stable operation irrespective of $I_{L}$. In summary, inclusion of the third idle state effectively implements cycleskipping and adapts the rate at which the integrator updates its value, $f_{c}=f_{C L K} / m$ for integer $m$,


Figure 7.6: DLDO transient $V_{\text {out }}$ response with and without the proposed PD compensator.
with the output rate, $f_{L}$, to maintain the output pole, $e^{-f_{L} / f_{c}}$, inside the unity circle.

### 7.4 Sub-LSB Hysteretic PWM Control

### 7.4.1 Minimum Current Limit of Linear-Search Based DLDOs

## DC Accuracy Limitation

In a linear-search DLDO, limit cycling modulates the duty-cycle of the $n$ oscillating fingers at $f_{C L K} /(2 n)$ to maintain the average $V_{\text {out }}$ close to $V_{\text {ref }}$. Unfortunately, such duty-cycle modulation fails to provide the desired $V_{\text {out }}$ level, as the current of cycling on/off LSBs becomes comparable to $I_{L}$ and hence gives a more pronounced $V_{\text {out }}$ steady-state error, especially at large drop-out voltages. For instance, in a 7-bit barrel-based DLDO, the steady-state error exceeds $\pm 1 \% V_{\text {ref }}$ when $I_{L}$ is $2^{2.6}$ below $I_{L, \max }$ for $V_{\text {ref }}=V_{\text {in }} / 2$ (Fig. 7.7a).

## $V_{\text {out }}$ Peak-to-Peak Ripple Limitation

Limit-cycle oscillations result in an output voltage ripple, $\Delta V_{\text {out }, p-p}$, proportional to the number of limit-cycling fingers, $n$. As $I_{L}$ is reduced, the output peak-to-peak ripple increases since the effect of the limit-cycling fingers on $V_{\text {out }}$ is more pronounced at lighter loads, as illustrated in Fig. 7.7b. $\Delta V_{\text {out }, p-p}$ can be reduced by increasing the DLDO operating frequency, $f_{C L K}$, and


Figure 7.7: Setting-limits of digital LDOs minimum current. (a) Simulated steady-state error versus $R_{L}$ at $f_{C L K}=10 \times f_{L}$ and (b) simulated steady-state $V_{\text {out }}$ ripple $\Delta V_{\text {out }, p-p}$ versus $f_{C L K} / f_{L}$ of a 7 -bit shifter-based DLDO with $V_{\text {ref }}=V_{i n} / 2$.
hence the DLDO output ripple frequency $f_{C L K} /(2 n)$, beyond the output RC network corner $f_{L}$, as in Fig. 7.7b Unfortunately, the higher $f_{C L K}$ results in a lower damping factor $\eta$ and eventually an oscillatory response, as discussed in section 7.3.1. For a 7-bit DLDO, when $V_{\text {ref }}=0.9 V_{\text {in }}$, the load current range with a peak-to-peak ripple below 50 mV is limited to $2^{6.7}$.

Therefore, the LDO minimum $I_{L}$ is typically limited by the acceptable steady-state error level, at large $\Delta V_{\text {drop-out }}$, and the allowable output voltage ripple, at small $\Delta V_{\text {drop-out }}$, of the limit-cycling $\operatorname{LSB}(\mathrm{s})$, and hence the resolution $N$, which determines both, defines the achievable dynamic range $I_{L, \max } / I_{L, \min }$. Unfortunately, increasing the resolution $N$ comes with a worse transient FOM, as in Fig. 7.3 .

### 7.4.2 Minimum Current Limit in a Binary-Search DLDO

On the other hand, after the SAR conversion, $V_{\text {out }}$ becomes within one $I_{L} \times \mathrm{LSB}$ of the desired target, $V_{r e f}$. Since the RLDO does not exhibit limit-cycle oscillations, the PMOS array conductance $G_{p}$ converges to $G_{r e f} \pm \mathrm{LSB}$ at steady-state, where $G_{r e f}$ is the PMOS conductance
that makes $V_{\text {out }}$ matches $V_{\text {ref }}$ and the final $V_{\text {out }}$ value is $V_{\text {in }} \times G_{p} /\left(G_{p}+G_{L}\right)$. As a result, the worst-case error becomes $\sim \pm \frac{\text { LSB }}{G_{L}\left(1+G_{r e f} / G_{L}\right)} \times V_{\text {in }}$ or $\pm \frac{\text { LSB }}{G_{L}} \Delta V_{\text {drop-out }}$. As the load current, $G_{L}$, is reduced or $\Delta V_{\text {drop-out }}$ is increased, the worst-case steady-state error increases. Therefore, the load dynamic range $I_{L, \max } / I_{L, \min }$ with certain error $-\%$ becomes limited to: error $-\% \times 2^{N}$; e.g. ( $N-6.6$ ) bits range for $\mathrm{a} \pm 1 \% V_{r e f}$ error. Therefore, the second challenge of the SAR LDO architecture, after its inherent instability (Section 7.3), is the limited DC accuracy that made such SAR architecture previously impractical.

### 7.4.3 Hysteretic PWM Control

To mitigate the accuracy problem as well as enable sub-LSB $I_{L}$ regulation, a redundant LSB switch is employed while the duty-ratio, $D$, of its gate voltage is modulated like a lossy switched-mode buck converter. Instead of the periodic $10-70 \mathrm{mV}$ ripple encountered during limit-cycling in prior DLDOs [125], a dual-bound $\left(V_{r e f, H}, V_{\text {ref }, L}\right)$ hysteretic PWM controller is used to generate the redundant LSB drive signal. Once the SAR controller brings $V_{\text {out }}$ to within the hysteretic window, the SAR controller is clock-gated and the PWM control is enabled. Here, the same SAR P-term comparators are reused, as will be discussed, to set or reset an SR latch that provides the gate voltage, $P W M$, of the redundant LSB, as in Fig. 7.8.

The average output value of a dual-bound hysteretic PWM control scheme is half of the hysteresis height, $V_{H} / 2$, as in Fig. 7.8b. Therefore, the steady-state error in a DLDO can be eliminated by setting the hysteresis bounds at $V_{\text {ref }, H}=V_{\text {ref }}+V_{H} / 2$ and $V_{\text {ref }, L}=V_{\text {ref }}-V_{H} / 2$, under a high sampling rate. Furthermore, at light loads, the minimum current supplied by the DLDO can go below the LSB finger current by $I_{s u b-L S B}=\frac{1}{l+1} I_{L S B} 7.8 \mathrm{~b}$, with zero steady-state error and without the ripple exceeding $V_{H}$, unlike limit-cycle oscillation. Therefore, the achievable effective-LSB, and thus the dynamic range, is extended by $\log _{2}(l+1)$ bits. The PWM controller also limits the cycling switches to only a single PMOS finger which reduces the steady-state quiescent power.


(a)

Figure 7.8: Hysteretic dual-bound controller. (a) Top-level schematic. (b) Operation.


Figure 7.9: Top-level state diagram of the proposed RLDO.

### 7.5 Circuit Implementation

The proposed RLDO incorporates the three techniques discussed above: SAR switching, adaptive PD compensation, and sub-LSB PWM control. As shown in Fig. 7.9, once the RLDO is enabled $(E N=1)$, the SAR control loop is initiated and the number of turned-on PMOS fingers is adjusted to coarsely set $V_{\text {out }}$ to $V_{\text {ref }}$. Once $E o C$ is asserted, the duty-cycle controller is enabled to perform sub-LSB fine regulation. If, during duty-control or SAR conversion steps, a sudden load-current $\Delta I_{L}$ or input-voltage $\Delta V_{\text {in }}$ step occurs that knocks $V_{\text {out }}$ outside the control hysteresis, upper- and lower-bound correction logic restarts the SAR operation by asserting $C N V$.

Table 7.2: Truth table of PD compensator

| From Block <br> Diagram | Logic Inputs |  | Logic Outputs |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{u}[\mathrm{k}]$ | $e^{*}$ | $\Delta V_{\text {out }}^{*}$ | $I N C$ | $D E C$ |
| +1 | 1 | 1 | CLK | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 |
| -1 | 0 | 0 | 0 | CLK |


(a)

(b)

Figure 7.10: Quantized gate-level implementation. (a) Equivalence of a clocked comparator to a quantized AND gate. (b) Quantized gate-level implementation of the PD compensator truth table in table 7.2

### 7.5.1 Proportional-Derivative Compensator Implementation

The PD compensator output can take on one of three values, either $+1,-1$, or 0 ; thus, two variables, $I N C$ and $D E C$, are used to represent the PD output state. Table 7.2 illustrates the relationship of these variables when $K_{P}$ and $K_{D}$ are set to $1 / 2$. The $I N C$ and $D E C$ signals are not static signals that take on a fixed value but rather act as pulsed signals, since they are used as clock inputs to the SAR logic in the RLDO.

A clocked sense amplifier can be considered as a differential quantized AND gate, as in Fig. 7.10a. Thus, when the quantized difference $\Delta V^{*}=\left[V_{+}-V_{-}\right]^{*}$ between the input analog signals is 1 , the sampling clock propagates through the positive output $O_{p}$ of the proposed gate and vice versa for the negative output $O_{n}$. Figure 7.10 b illustrates the quantized gate-level implementation of the truth table of the PD compensator, where $e_{x}^{*}[k]$ represents the quantized error $\left[V_{\text {ref }, x}-V_{\text {out }}[k]\right]^{*}$. The PD essentially acts as an XNOR gate, where it gives a true (pulsing)


Figure 7.11: Top-level schematic of the implemented PD compensator, including PWM comparators, derivative-term comparators, and bottom-plate sampling circuitry. Insets illustrate the lst edge pass and DC correction logic.
output when the number of true inputs is even as shown in Table 7.2 .
The schematic of the implemented PD compensator is shown in Fig. 7.11. The two $P W M$ comparators, $\operatorname{COMP}_{H}$ and $\operatorname{COMP}_{L}$, implement the proportional term and thus provide the quantized errors $\left[V_{\text {ref }, H}-V_{\text {out }}[k]\right]^{*}$ and $\left[V_{\text {ref }, L}-V_{\text {out }}[k]\right]^{*}$, respectively. They establish a hysteresis where all the RLDO circuitry is disabled for minimal $I_{Q}$ and $V_{\text {out }}$ comes to a halt. The comparators $D I F F_{H}$ and $D I F F_{L}$ implement the differential term $\Delta V_{\text {out }}^{*}[k]$.

The derivative term $\left(V_{\text {out }}[k-1]-V_{\text {out }}[k]\right)$ of the PD compensator is implemented through the bottom-plate sampling circuit illustrated in Fig. 7.11 comprised of a footer NMOS switch and a 560 fF sampling capacitor. When the sampling clock $M 2 G$ is high, the value of $V_{\text {out }}$ is stored across the sampling capacitor. When $M 2 G$ is low, the voltage of the negative comparator terminal becomes the difference term $\left(V_{\text {out }}[k]-V_{\text {out }}[k-1]\right)$. The difference is then compared to a virtual ground through the comparators $D I F F_{H}$ and $D I F F_{L}$ to produce the quantized difference term $\left[V_{\text {out }}[k-1]-V_{\text {out }}[k]\right]^{*}$. A replica sample-and-hold circuit is employed to sample $V_{\text {ref }, L}$ ( $\approx V_{\text {out }}$ ) in order to establish a virtual ground at the positive terminal of the comparators, so that errors due to charge-injection, clock feed-through, or comparator kickback noise are canceled via the inherent symmetry. The sampling switches M1 and M2 in Fig. 7.11 are enabled from


Figure 7.12: Difference accumulation to overpower KT/C noise due to a small sampling capacitor size.

INC and DEC instead of $C L K L$ and $C L K H$ so that the difference $\Delta V_{\text {out }}$ accumulates and becomes $\left(V_{\text {out }}[k]-V_{\text {out }}[k-m]\right)$. Therefore, if an erroneous $D I F F_{H}$ or $D I F F_{L}$ comparison results due to $k T / C$ noise, the difference increases until it overpowers this error, as shown in Fig. 7.12 ,

When $V_{\text {out }}$ is within hysteresis, $O O H$ is reset low through the two SR latches in Fig. 7.11. Thus, the footer NMOS sampling switches are statically enabled through the OR gate that propagates the complement of $O O H$ to the sampling switch gate in Fig. 7.11. Once $V_{\text {out }}$ is outside hysteresis, the rising edge of the first clock pulse CLKL or CLKH might trigger the comparators $\operatorname{DIFF}_{L}$ and $D I F F_{H}$ when the difference $\left(V_{\text {out }}[k]-V_{\text {out }}[k-1]\right)$ is small, and hence the comparison result becomes stochastic. As a result, a clock pulse might be lost, which would increase the RLDO response-time. To prevent this, the first edge pass logic in Fig. 7.11 allows the first clock pulse $C L K L(C L K H)$ to pass directly to $I N C(D E C)$ irrespective of the $D I F F_{L}\left(D I F F_{H}\right)$ comparison result.

### 7.5.2 Regulation-Loop Interruption Logic

## Lower-Bound Correction Loop

During duty-control, the redundant LSB current can fail to bring the output voltage above $V_{r e f, L}$ if a large load-current increase $+\Delta I_{L}$ or a large input-voltage decrease $-\Delta V_{\text {in }}$ occurs. Therefore, lower-bound trigger logic is employed to restart the SAR search with the MSB,
$B(6)$, on and the remaining PMOS transistors off, by asserting $C N V_{D}$, created by AND-ing the redundant LSB gate signal, $P W M$, and $I N C^{2}$. In the worst-case, the lower-bound trigger logic takes two clock cycles to turn on the MSB switch, and hence the clock frequency is set by the required response-time as: $f_{C L K}>2 / T_{R}$.

To avoid artificial undershoots that may be introduced if a large $+I_{L}$ or $-V_{\text {in }}$ step change occurs in the middle of the SAR conversion steps after deciding the first few MSBs, a branchprediction scheme ${ }^{3}$ is implemented. As in Fig. 7.13a, when two consecutive INC edges result without any $D E C$ assertion in-between during the SAR bit-cycling, the branch-prediction logic predicts that there is a disturbance, $+\Delta I_{L}$ or $-\Delta V_{i n}$, that is large in amplitude, and therefore, the remaining unchecked $l$ bits in the switch array may not be able to compensate for this disturbance. To verify this, the branch-prediction logic temporarily enables the remaining unchecked $l$ bits in the SAR register by asserting $O N$, as in Fig. 7.16a where $B(2: 0)$ and PWM LSB $B(-1)$ are temporarily enabled. If $V_{\text {out }}$ exceeds $V_{\text {ref }, H}$ (i.e., $D E C=1$ ), then the present disturbance, $+\Delta I_{L}$ or $-\Delta V_{\text {in }}$, can be successfully accounted for through the remaining $l$ bits and the SAR operation can be continued as normal. This is similar to a branch-not-taken in pipeline hazards terminology. Otherwise, if $V_{\text {out }}$ is still slewing downwards below $V_{\text {ref }, L}$ despite the $l$ bits being turned-on, then the SAR search should be restarted. This is equivalent to a branch-taken in pipeline design. Here, a third $I N C$ edge results although the $l$ bits are turned-on, and hence the branch-prediction logic restarts the SAR search operation at $B[k]=7^{\prime} b 0111111\left(B(6)\right.$ is on) by asserting $C N V_{S A R}$, as in Fig 7.16 a

Branch prediction is implemented using a 3-bit sequence detector, as in Fig. 7.13b, The Moore state-machine asserts its outputs, $O N=1$ and $C N V_{S A R}=1$, when it recognizes two and three $I N C$ edges in succession, respectively. The state-machine resets to its initial state when a $D E C$ edge results. As shown in the inset in Fig. 7.13b, a 2-input multiplexer is used to select the

[^5]

Figure 7.13: Proposed branch-prediction (a) flowchart and (b) state-diagram implementation. Inset: the SAR reset $C N V$ selection based on $E o C$ through an output multiplexer.


Figure 7.14: State-diagram implementation of the upper bound.
complement of either $C N V_{D}$ or $C N V_{S A R}$ as the SAR controller restart signal, $C N V$, based on the present active control loop, as determined by $E o C$.

## Upper-Bound Correction Loop

During duty control, a large disturbance can make the output voltage exceed $V_{r e f, H}$, even though the redundant LSB is turned off. Similarly, in the middle of the $N$-step SAR conversion, a sudden $-\Delta I_{L}$ or $+\Delta V_{\text {in }}$ can result in increasing $V_{\text {out }}$ above $V_{\text {ref }, H}$, (i.e., overshoot), despite the turn-off of the last determined bit during SAR bit-cycling. This indicates that the already determined bits, e.g. $B(6: 3)$ in Fig. 7.16b, hold the wrong value due to the unaccounted $-\Delta I_{L}$ or $+\Delta V_{\text {in }}$ disturbance. Such interruptions during SAR control (or PWM) should be corrected by
disabling all the switch-array PMOS fingers immediately instead of wasting time investigating the remaining bits (e.g., $B(2: 0)$ in Fig. 7.16b), and increasing the $V_{\text {out }}$ overshoot. The proposed lower-bound trigger logic is implemented through a 2-bit sequence detector, as shown in Fig. 7.14. The sequence detector asserts its output, purge, when two $D E C$ edges occur in succession, as in Fig. 7.16b when turning off $B(3)$ is not enough to stop overshooting. On the other hand, the Moore machine resets to its initial state when an INC edge occurs. After turning off the whole PMOS switch array through the activated purge, the lower-bound logic restarts the SAR search operation when the output voltage, $V_{\text {out }}$, falls below $V_{\text {ref }, L}$ due to any $+\Delta I_{L}$ or $-\Delta V_{\text {in }}$ disturbance, as in Fig. 7.16b when the PWM LSB $B(-1)$ is not enough to supply the required $I_{L}$.

## DC Correction Loop

PD compensation has the disadvantage that it can reduce the steady-state DC accuracy. During binary search, the PMOS DAC array is sequentially turned on, starting from the MSB, until $V_{\text {out }}$ reverses the direction of its slope and starts to increase towards the hysteresis window after turning on $B(i)$. Since the PMOS array current depends on the drop-out voltage, as $V_{\text {out }}$ increases, the current of the PMOS array, including the last turned-on finger $B(i)$, decreases and $V_{\text {out }}$ can get stuck below $V_{\text {ref }, L}$ without reaching the hysteresis window.

In order to avoid such case, the next MSB switch $B(i-1)$ is turned on, as an extra safety step, although the present bit, $B(i)$, is enough to charge $V_{\text {out }}$ towards the hysteresis. When $V_{\text {out }}$ is less than $V_{r e f, L}$ and is increasing, the sampling clock propagates to the negative output of $\operatorname{DIFF}_{L}, D L-$, in Fig. 7.11, which is employed to set a DC correction flag CORR. As a result, the next MSB in the switch array, $B(i-1)$, is temporarily turned on to provide an extra half- $B(i)$ conductance in parallel to the present investigated bit, $B(i)$, to ensure the rapid increase of $V_{\text {out }}$ towards $V_{r e f, L}$ and hence avoid any possible condition of $V_{\text {out }}$ being stuck below $V_{r e f, L}$. Once $V_{\text {out }}$ exceeds $V_{r e f, L}, D N$ in Fig. 7.11 is reset low which clears the DC correction flag CORR, and hence, the temporarily turned-on bit is turned back off. For instance, in Fig. 7.16a, $B(3)$ is turned

```
// initially B=7'b1111111 and i=6
BinarySearch(B(6:0),i)
    turn on B(i); //perturb phase
    //observe phase
    if (posedge DEC)
        turn off B(i);
    if (posedge INC)
        keep B(i) on;
        if i=0
            return;
        else
            return BinarySearch(B,i-1);
```

Figure 7.15: SAR backbone pseudo code.
on immediately at the next rising clock edge, via CORR, until $V_{\text {out }}$ exceeds $V_{\text {ref }, L}$. A similar architecture can follow to avoid the case of $V_{\text {out }}$ being stuck above $V_{\text {ref }, H}$.

### 7.5.3 Successive Approximation Controller

The RLDO's SAR logic follows a perturb-and-observe algorithm to determine the value of each bit in the PMOS switch array (Fig. 7.15). In the perturb phase, one bit in the switch array is turned on in order to test its output current value in comparison to the load current $I_{L}$. In the observe phase, the comparison result of the output voltage $V_{\text {out }}$ with the desired target $V_{\text {ref }}$ determines the value of the binary bit being tested through $D E C$ and $I N C$, coming from the PD controller to avoid oscillatory response as discussed. If $V_{\text {out }}$ exceeds $V_{\text {ref }}$ due to the present bit output current, $D E C$ is set high, and the present bit under test is turned off. Otherwise, if $V_{\text {out }}$ falls below $V_{\text {ref }}$, INC transitions from 0 to 1 , and the present bit being tested is left on and the conversion process proceeds to the next MSB until all the bits in the switch array have been determined. Figure 7.16 shows a simulated timing diagram and response of two representative load steps.


Figure 7.16: The RLDO simulated response to a periodic load swing between $40 \mu \mathrm{~A}$ and $200 \mu \mathrm{~A}$ within $200 \mathrm{ps}, V_{\text {in }}=0.5 \mathrm{~V}, V_{\text {ref }}=0.45 \mathrm{~V}, C_{\text {out }}=0.4 \mathrm{nF}$, and $f_{\text {CLK }}=100 \mathrm{MHz}$.

### 7.6 Experimental Verification

A 7-bit RLDO was implemented in $0.0023 \mathrm{~mm}^{2}$ in 65 nm CMOS, including the PMOS DAC array, SAR control logic, comparators, sample and hold circuits, an on-chip load, and 0.4 nF on-chip decoupling capacitance (Fig. 7.17).

### 7.6.1 Transient Measurements

## Mitigating the Speed-Stability Challenge in SAR LDOs

Taking SAR decisions at a faster rate than the rate of change of $V_{\text {out }}$ (i.e., $f_{L}$ ) in the hope of reducing the large staircase overshoots and undershoots of binary search results in erroneous successive decisions during the SAR conversion and hence an unstable response. Figure 7.18 illustrates reliable SAR convergence despite employing a much faster clock frequency, 100 MHz , than the output time constant, $411.64 \mu \mathrm{~s}$, thanks to the proposed PD compensation. Here, when $I_{L}$


Figure 7.17: Die photo of fabricated RLDO.
is $40 \mu \mathrm{~A}$, the PWM controller tries to maintain $V_{\text {out }}$ at $V_{\text {ref }}$. On the other hand, when $I_{L}$ increases to $1.1 \mathrm{~mA}, V_{\text {out }}$ starts to discharge downwards until it falls below $V_{\text {ref, } L}$ (Fig. 7.18 inset). As a result, the lower-bound trigger logic restarts the coarse binary search with $B(6)$ on and the remaining PMOS transistors off, and hence $V_{\text {out }}$ rapidly increases until it exceeds $V_{\text {ref }, H}$. At that moment, $D E C$ is activated to turn off $B(6)$ and $V_{\text {out }}$ discharges until it reaches $V_{r e f, L}$, and then binary search continues in a similar manner. The measured duration between the turn-on of the successive PMOS DAC switches is approximately $50 \mu \mathrm{~s}, 60 \mu \mathrm{~s}, 80 \mu \mathrm{~s}$, and $100 \mu \mathrm{~s}$, as in Fig. 7.18. Therefore, although $V_{\text {out }}$ is sampled and observed at a high rate $(100 \mathrm{MHz})$ to reduce overshoots/undershoots, the SAR register is updated only at a rate close to the output time-constant, which enables stable operation. On the other hand, if the input 100 MHz clock instead directly ran the SAR register, $B(6)$ would be left on since $V_{\text {out }}$ does not exceed $V_{\text {ref }, H}$ in the following clock cycle (and rather takes $\sim 2 \mu$ s to reach $\left.V_{\text {ref }, H}\right)$. This would result in an erroneous $B(6)$ decision, and the SAR operation would never converge. As shown in Fig. 7.18, $V_{\text {out }}$ increases with binary-weighted slew rates across the successive iterations. When the SAR controller turns on $B(2), V_{\text {out }}$ settles to within the hysteresis window, and hence $B(2)$ can provide the required $I_{L}\left(I_{L S B} \sim 275 \mu \mathrm{~A}\right)$. As shown, the RLDO reaches a complete halt-state at steady-state with minimal quiescent power, where the SAR and PWM controllers are gated off. To realize below $\pm 4 \%$ steady-state error, the


Figure 7.18: Measured transient response of the RLDO to periodic square-wave load current variation with $V_{\text {IN }}=1 \mathrm{~V}, V_{\text {OUT }}=0.45 \mathrm{~V}$, and $C_{\text {out }}=1 \mu \mathrm{~F}$,.

PWM controller can be enabled once $V_{\text {out }}$ is within the hysteresis.
The correct SAR-loop conversion is further verified at all possible $I_{L}$ and $V_{i n}$ values via the measured load and line regulation plots in Fig. 7.21. Thanks to the PD compensation and DC correction loop, the LDO is able to make the correct successive decisions and reach the target $V_{\text {ref }}$ even while employing high speed clocks ranging between 10 MHz and 100 MHz , all without getting stuck outside the hysteresis.

## Exponential Improvement in FoM via Binary Search

Figure 7.19 shows the measured transient response of the RLDO for periodic (at 1 kHz ) on-chip load changes between $40 \mu \mathrm{~A}$ and 1.1 mA with 1 ns rise/fall time. The RLDO at $V_{\text {in }}=0.5 \mathrm{~V}$ maintains less than 40 mV undershoot below $V_{\text {ref }}=0.45 \mathrm{~V}$ ( not $V_{\text {ref }, L}$ ), for a quiescent current of $14 \mu \mathrm{~A}$ at a clock frequency of 100 MHz , thereby achieving a response time of 15.1 ns and a settling time of 100 ns . After the $I_{L}$ step increase, the SAR operation is restarted with $B(6)$ on and the remaining bits off. For a $\Delta V_{\text {drop }- \text { out }}$ of $\sim 90 \mathrm{mV}$, the MSB current is $\sim 845 \mu \mathrm{~A}$. Therefore, $B(6)$ slows down the $V_{\text {out }}$ discharge rate until the following clock edge, after 10 ns , when $B(5)$ is turned on, which reverses $V_{\text {out }}$ direction and brings $V_{\text {out }}$ back to within the hysteresis, and hence $T_{R}$ is ideally 10 ns . In contrast, a modeled 65 nm shifter-based linear search DLDO with the same fan-out
capability provides $25 \times\left(>2^{N} / N\right)$ and $13.7 \times$ slower $T_{R}$ and $T_{S}$, respectively, while consuming $2^{7} \times$ the quiescent current. Thus, the RLDO achieves a FOM of 199.4 ps at $V_{I N}=0.5 \mathrm{~V}$, while the modeled 65 nm DLDO achieves 638 ns , illustrating higher than $2^{2 N} / N$ FOM improvement as predicted from theory. The RLDO's measured overshoot is 62 mV which demonstrates the effectiveness of the proposed upper-bound trigger logic to disable the PMOS switches of $B(6: 5)$ immediately rather than waiting to check the remaining bits $B(4: 0)$, after 1.1 mA -to- $40 \mu \mathrm{~A}$ change. Afterwards, $V_{\text {out }}$ is regulated by the duty controller to its steady-state value of 0.45 V .

## PD Compensation Efficacy

Figure 7.19 also shows the efficacy of the proposed PD compensation scheme - stable load step tests were performed even with a $1 \mu \mathrm{~F}$ external capacitor to make the output pole, $f_{L}, 2500 \times$ more dominant than $f_{C L K}=100 \mathrm{MHz}$, which would render a baseline DLDO fully oscillatory. As shown in Fig. 7.19, the well-behaved first-order-like $V_{\text {out }}$ response confirms the elimination of the integrator pole from the loop dynamics, as predicted from the theory in Section 7.3, thereby realizing a single pole system, and hence achieving stable operation irrespective of $C_{o u t}, I_{L}$, and conductance step $M$.

### 7.6.2 Steady-State Measurements

## Mitigating Limited DC Accuracy in SAR DLDOs

Figure 7.20 demonstrates the efficacy of the PWM controller. At a current $20 \times$ less than $I_{L S B}\left(\approx 20 \mu \mathrm{~A}\right.$ at $\left.\Delta V_{\text {drop-out }}=0.2 \mathrm{~V}\right)$, the PWM controller regulates $V_{\text {out }}$ with $<0.2 \% V_{\text {ref }}$ steady-state error (20.3-bit), all with $I_{L}$-independent peak-to-peak ripple of 20 mV . Figure 7.21 demonstrates that the proposed PWM control mitigates the challenge of limited current range with acceptable DC accuracy in binary-search DLDOs, and in fact achieves load regulation with less than $\pm 2 \% V_{\text {in }}$ steady-state error for a hysteresis window of $\pm 10 \mathrm{mV}$ across a $20,000 \times$


Figure 7.19: Measured transient response of the RLDO to a periodic square-wave load current variation with $V_{\text {IN }}=0.5 \mathrm{~V}, V_{\text {OUT }}=0.45 \mathrm{~V}$, and $C_{\text {out }}=0.4 \mathrm{nF}$ (top). When $C_{\text {out }}=1 \mu \mathrm{~F}$, the RLDO remains stable during periodic positive and negative load steps (bottom).


Figure 7.20: Measurement of output voltage ripple during PWM duty control.
load dynamic range (14.3-bit: from 100 nA to 2 mA ). Without PWM control, the dynamic range would be limited to $2.64 \times$ (1.4-bit). In this design, the PWM controller is only enabled after the SAR fully converge and not within the intermediate iterations, which trades accuracy for power. Therefore, for $I_{L}>I_{L S B}$, the PWM controller may or may not be enabled depending on whether


Figure 7.21: (a) Line regulation measurement at a clock frequency of 100 MHz and a load current of 1 mA . (b) Load regulation measurement at $f_{C L K}=10 \mathrm{MHz}$ and $V_{i n}=0.5 \mathrm{~V}$.
the SAR-halt- $V_{\text {out }}$-value resides within the target hysteresis or not. This enables $< \pm 2 \% V_{\text {ref }}$ and $<0.6 \% V_{\text {ref }}$ steady-state error when the PWM is disabled and enabled, respectively, as in Fig. 7.21a. For $I_{L}<I_{L S B}$, the PWM controller is always enabled, and hence the steady-state error is $< \pm 0.6 \%$, even when $V_{\text {ref }} \sim V_{\text {in }} / 2$, as shown in Fig. 7.21a. In a non-PWM-modulated 7-bit DLDO, the minimum supplied current cannot go below $I_{L S B}$ due to accuracy and ripple limitations. Fortunately, PWM control can effectively perform $V_{\text {out }}$ regulation below the single finger current, enabling extension of the LDO effective resolution, $\log _{2}\left(I_{L, \max } / I_{L, \min }\right)$, from 7 b to 14.3 b, with a worst-case load regulation of $5.6 \mathrm{mV} / \mathrm{mA}$. As shown in Fig. 7.21b, a line regulation of $2.3 \mathrm{mV} / \mathrm{V}$ is achieved.

## Exponential Improvement in Quiescent Power

Figure 7.22 verifies the exponential $\left(1 / 2^{N}\right)$ improvement in the quiescent power of the RLDO architecture over baseline linear-search designs. A peak current efficiency of $99.8 \%$ for 0.5 V -to- 0.3 V conversion is achieved. More importantly, the RLDO achieves a current efficiency greater than $90 \%$ from $33.6 \mu \mathrm{~A}$ to 2 mA , a $60 \times$ load current dynamic range (Fig. 7.22 a ), and hence achieves the widest load range with efficiency higher than $90 \%$ as compared to $1.6 \times$ (even


Figure 7.22: Measured current efficiency $\eta$. (a) At $V_{\text {in }}=0.5 \mathrm{~V}$ and $V_{\text {out }}=0.3 \mathrm{~V}\left(f_{C L K}=10 \mathrm{MHz}\right)$, demonstrating efficiency higher than $90 \%$ from $33.6 \mu \mathrm{~A}$ to 2 mA . (b) At $V_{\text {in }}=0.5 \mathrm{~V}$ and $V_{\text {out }}=0.45 \mathrm{~V}$ ( $f_{C L K}=10 \mathrm{MHz}$ ).
with fine-grain clock gating of the barrel-shifter [130]), $3.3 \times$, and $10 \times$ in the prior-art (Fig. 7.23). Furthermore, as depicted in Fig. 7.22b, current efficiency greater than $84.4 \%$ is achieved across a $50 \times$ dynamic range at 0.5 V -to- 0.45 V , exceeding a simulated barrel-based DLDO by $46.4 \%$.

### 7.6.3 Performance Summary

From load regulation measurements and transient tests at various $f_{C L K}$ frequencies ranging from 1 MHz to 240 MHz , the RLDO with PD compensation and PWM regulation is stable irrespective of $I_{L}, f_{C L K}$, and $C_{\text {out }}$. In comparison to prior-art DLDOs in Fig. 7.23, the RLDO at 0.5 V achieves the fastest response $(3 \times)$ and settling $(11 \times)$ times, largest load dynamic range, smallest area $(9.13 \times)$, and best FOM $(13.8 \times)$.

### 7.7 Conclusion

This chapter has described a new recursive DLDO architecture that improved response time, settling time, active area, and quiescent power over conventional and augmented linear-


Figure 7.23: Comparison of the RLDO with prior-art DLDOs.
search-based DLDOs via the use of a successive-approximation control scheme with PD compensation. Additionally, the RLDO's dynamic load range and steady-state error performance was enhanced through duty control of an additional LSB transistor.

### 7.8 Acknowledgements

This chapter is based on and mostly a reprint of the following publications:

- L.G. Salem, J. Warchall, and P.P. Mercier, "A successive-approximation low drop-out voltage regulator," IEEE Journal of Solid-State Circuits (JSSC), Jan. 2018, vol. 53, no. 1, pp. 35-49.
- L.G. Salem, J. Warchall, and P.P. Mercier, "A 100nA-2mA Successive-Approximation digital LDO with PD compensation and sub-LSB duty control achieving a 15.1 ns response-time at 0.5 V, , 2017 IEEE International Solid-State Circuits Conference (ISSCC)

Digest of Technical Papers, Feb. 2017, pp. 340-341.

## Chapter 8

## A Digital Low-Dropout Voltage Regulator Employing Switched-Capacitor Resistance

### 8.1 Introduction

Modern DVFS-enabled SoCs require nimble supply regulators that rapidly respond to abrupt load changes and offer fine resolution (e.g., 12.5 mV in [131], 10 mV in [132]) over large voltage and current dynamic ranges. Switch-array digital LDOs (SA-DLDOs) are a potentially attractive regulation option due to their ability to operate with low input voltages and in part to their modular digital nature and scalability. SA-DLDOs employ $2^{N}$ unary- [133] or binary-weighted [134] PMOS arrays that are modulated through a 1-bit or multi-bit ADCs to maintain the output voltage (Vout) at the desired level (Vref), as shown in Fig. 8.1(top left). Unfortunately, while array conductance in SA-DLDOs linearly increases with equal step size (GLSB) as the code is increased, the output voltage step, vLSB, does not; in fact, vLSB is nonlinear: $\sim$ GL $\times$ Vout $\times$ GLSB. Thus, SA-DLDOs achieve a nonlinear steady-state error, ess $=$ Vref - Vout $\sim \pm G L S B / G L \times V d r o p$, as shown in Fig. 8.1 (bottom left), that deteriorates at large dropout voltages, Vdrop $=$ Vin - Vout, and at small loads, GL. As a result, the required supply step of 10 mV (with $\pm 15 \%$ typical


Figure 8.1: A conventional switch-array DLDO (left) and its accuracy problem (bottom left and right); proposed SCR-DLDO using a switched-capacitor resistance and its frequencyprogrammable equivalent conductance (right and bottom-right).
accuracy) to perform per-core DVFS over a typical 100 times load dynamic range requires an impractical 16b PMOS array resolution. Even with limit-cycle oscillations, the load range that can achieve $\pm 1.5 \mathrm{mV}$ accuracy is provably limited to $2^{N-6.7}$ at Vref $=$ Vin/2 (Fig. 8.2, top left), which would still require a 14 b array resolution that, even if it were feasible to build, would come with linearly (for binary search) or exponentially (for linear search) increased response time (TR), quiescent power (IQ), and area.

### 8.2 Switched-Capacitor Low-Dropout Voltage Regulator

To enable industry-compliant digital replacement to analog LDOs, this work replaces the PMOS array in a SA-DLDO with a switched-capacitor resistance (SCR) that is created by switching the DLDO output capacitor Co as shown in Fig. 8.1 (top right). The SCR is then frequency-modulated through a hysteretic comparator to regulate Vout at Vref. Using two nonoverlapping clocks, the top $(\mathrm{H})$ and bottom (L) terminals of Co are alternately connected to (Vin, Vout) in $\Phi_{a}$ and (Vout, Vin) in $\Phi_{b}$ to charge and discharge Co by $2 \times$ Vdrop, which maximizes the charge delivered per unit capacitance. Importantly, a bilinear instead of series SCR is utilized to ensure $100 \%$ of Co is always connected to Vout despite switch commutation. In Fig. 8.1, the charge transferred, and hence the SCR equivalent conductance, GSCR, increases linearly with fsw in the slow-switching limit ( SSL ) region, $\mathrm{G}_{S C R, S S L}=4 \mathrm{Co} \times \mathrm{fsw}$, until saturating in the fast-switching limit (FSL) region, $\mathrm{G}_{S C R, F S L}$, to its maximum value of $\mathrm{G}=1 /(2 \mathrm{Ron})$ when Tsw is near the SC time-constant, $\tau=\mathrm{Co} / \mathrm{G}$, where Ron is the switch equivalent on-resistance. Since GSCR can be made arbitrarily small, the SCR-DLDO can, unlike SA-DLDOs, regulate down to arbitrarily low IL.

Interestingly, placing the SCR connected to the load in feedback with a hysteretic comparator establishes a Vref-controlled relaxation oscillator which accumulates the difference $\Delta \mathrm{V}=$ Vref - Vout to determine the oscillator period, Tsw $(=1 / \mathrm{fsw})$, that realizes Vout $=$ Vref. While both SA-DLDOs and the proposed SCR-DLDO follow a search (i.e., integration) control law of the control variable, the hysteretic oscillator is nonlinear and hence abruptly finds the target fsw (i.e., with $\sim 0$ acquisition time), enabling a response time that is limited only by comparator latency. Compared to an RLDO [134] that achieves $\mathrm{TR}=\mathrm{N} \times$ TCLK (Fig. 8.2, bottom right), the SCR-DLDO achieves TR $<$ TCLK for the same $\Delta$ IL and Co, where Vdrop is typically larger than Vdroop and hence the charge storage capacity of Co is amplified by the actively produced voltage swing, 2 V drop. This, along with the efficient SCR-DLDO architecture, enables provably 2 N and
$2^{2 N+1}$ better FOM over binary- [134] and linear-search [133] SA-DLDOs, respectively.
Using a clocked comparator, Tsw can only take on integer multiples of the comparator sampling clock (i.e., $T s w=\mathrm{k} \times$ TCLK). In this case, the minimum GSCR step within the SSL is $\mathrm{G}_{S} C R, S S L \mathrm{f}[\mathrm{k}] / \mathrm{fCLK}$. Thus, the achievable Vout resolution is vLSB $=$ Vout $\mathrm{f}[\mathrm{k}] / \mathrm{fCLK}$, and hence ess, unlike in SA-DLDOs, actually enhances with smaller IL or larger Vdrop due to a decreasing $\mathrm{f}[\mathrm{k}]$. In the FSL region, the SCR GLSB and vLSB values can be found from the GSCR expression in Fig. 8.1 (bottom right). Throughout the SSL and FSL regions, ess is below $1 \mathrm{mV}\left(\mathrm{ENOB}_{i} 10 \mathrm{~b}\right)$ across the entire Vout and IL ranges when employing an fCLK of only $4 / \tau$ (Fig. 8.2, top right). This outperforms the accuracy of a 10b SA-DLDO that not only suffers from ess $=138 \mathrm{mV}(\mathrm{ENOB}=2.9 \mathrm{~b})$, but that also suffers from a limited 0.65 -to-0.95 Vout range at $100 \times \mathrm{R}_{L, \text { min }}$. Above all, increasing the SCR resolution via a finer TCLK actually improves TR , unlike SA-DLDOs.

It can be shown that, for the same Co and for the maximum achievable conductance in an LDO (G), fsw of the SCR-DLDO is at least 20 times lower than the required fCLK in a SA-DLDO, making the switches gate-drive losses insignificant in the overall current efficiency, $\eta$. Unlike SA-DLDOs, fsw scales with load current in proportion to the time constant of the load (CL+Co)/GL, which exponentially scales the control overhead and the gate drive losses as shown in Fig. 8.2 (bottom left), vastly improving efficiency at light loads. Since the SCR-DLDO accuracy exponentially enhances with a lower load current, the comparator sampling clock, and thus its IQ, can be linearly scaled with IL by providing fCLK directly from the frequency-scaled clock of the underlying load as shown in the measurements in Fig. 8.2(bottom left).

### 8.3 Circuit Implementation

Unlike SA-DLDOs which enter limit-cycle oscillations and suffer from periodic $20-70 \mathrm{mV}$ ripple [133], output ripple, $\Delta \mathrm{V} p$, in the SCR-DLDO can be reduced by a factor M by time-


Figure 8.2: Accuracy advantage of a 1V DLDO using SCR to perform D/A conversion in the time-domain versus SA-DLDOs that achieve poor-accuracy conversion in the current-domain (top); overhead current reduction due to fCLK scaling (bottom left); fast response time advantage of the proposed SCR-DLDO.
interleaving M unit SCR cells as shown in Fig. 8.3(top right) for $\mathrm{M}=4$. Since $\Delta \mathrm{Vpp}$ nominally increases linearly with Vdrop as described by relation (1) in Fig. 8.3 (top left), the amount of capacitance that takes part in charge transfer can be scaled by a similar factor to the Vdrop increase, P , thereby canceling each other per relation (2), without affecting the load handling capability, $4 \mathrm{Co} \times$ Vdrop $\times$ fsw. The proposed binary-ripple-control (BRC) scheme, shown in Fig. 8.3 (bottom right), divides the capacitance and conductance of each of the 4 interleaved phases into 5 binary-weighted banks that are enabled by EN[4:0] and a redundant always-on LSB bank, where EN[4:0] can be provided from an existing battery state-of-charge monitoring circuit or


Figure 8.3: SCR ripple mitigation via the proposed binary ripple control scheme (top-left); toplevel block diagram of the implemented SCR-DLDO (top-right); cell partitioning to implement binary ripple control, along with the schematics of the SCR 1x unit-cell, non-overlap circuitry, and comparator (bottom).
the switching regulator supplying Vin. The SCR-DLDO 1x unit cell, non-overlap circuit, and comparator schematics are shown in Fig. 8.3 (bottom).

### 8.4 Measurement Results

The proposed SCR-DLDO is fully integrated in $0.00137 \mathrm{~mm}^{2}$ core area in 65 nm CMOS with $\mathrm{Co}=200 \mathrm{pF}$ and $\mathrm{CL}=165 \mathrm{pF}$ that mimics the inherent capacitance of a 3 mA digital load with 5\% activity. Steady-state measurement results in Fig. 8.4 demonstrate the efficacy of the SCR-DLDO in realizing high accuracy: a steady-state error of at most $\pm 1.55 \mathrm{mV}$ is measured
across all desired Vout values between $0.3-0.8 \mathrm{~V}$ over Vin corners of 0.5 V and 0.9 V , all over a 10A-3mA (300x) dynamic range (Fig. 8.4 left and top right). For Vin=0.9V, fCLK is set to 1 GHz to enable a Tsw LSB of $\tau / 4$, which would theoretically achieve $< \pm 1 \mathrm{mV}$ accuracy as in Fig. 8.2. Due to the comparator frequency-dependent s-shaped offset (Fig. 8.4 left), ess grows to $\pm 1.55 \mathrm{mV}$, which still enables up to a $174.7 \times$ accuracy improvement over a simulated 10 -bit shifter-based DLDO in Fig. 8.2 (top left). Since ess enhances with lower load current, fCLK is safely linearly scaled from 1 GHz down to 4 MHz at $10 \mu \mathrm{~A}$ to scale comparator IQ, which improves $\eta$ and IL-range by $50.3 \%$ and $10 \times$, respectively (Fig. 8.4, bottom right).

Compared to the RLDO in [134], at Vin $=0.5 \mathrm{~V}$ the SCR-DLDO reduces the measured error by $2.4-4.3 \times$ for Vout ranging between $0.3-0.45 \mathrm{~V}$ (Fig. 8.4, top right), demonstrating the accuracy advantage of SCR- over SA-DLDOs. The SCR-DLDO achieves a peak $\eta$ of $99.3 \%$, and at Vout $=0.3 \mathrm{~V}$, operates from $1.5 \mu \mathrm{~A}-1.75 \mathrm{~mA}$ with $\eta>70 \%$ (a $1,167 \times$ dynamic range), exceeding [134] by $5 \times$ and improving light-load efficiency by $37 \%$ (Fig. 8.4, bottom right). With IQ=48.4A $(\mathrm{fCLK}=1 \mathrm{GHz})$, the SCR-LDO achieves a measured $\mathrm{TR}=2.48 \mathrm{~ns}$ with Vdroop $=20.5 \mathrm{mV}$ in response to periodic 50A-3.3mA on-chip load swings occurring within 200ps. Thus, the achieved FOM is 36.9 ps , for a $5.4 \times$ improvement over prior-art (Fig. 8.6). BRC operation is demonstrated to reduce ripple from 161.3 mV to 21.7 mV at the worst-case voltage corner ( $\mathrm{Vin}=0.9 \mathrm{~V}$ to Vout $=0.3 \mathrm{~V}$ ) across a $300 \times$ IL range (Fig. 8.5, bottom left and right) and at the worst-case current corner (Vin=0.9V with $\mathrm{IL}=10 \mathrm{~A}$ ) for Vout between $0.3-0.8 \mathrm{~V}$ (bottom middle). A die photo is shown in Fig. 8.7.

### 8.5 Acknowledgements

This chapter is based on and mostly a reprint of the following publication:
L.G. Salem and P.P. Mercier, "A sub-1.55mV accuracy 36.9ps FOM digital low-dropout regulator employing switched-capacitor resistance," 2018 IEEE International Solid-State


Figure 8.4: SCR-DLDO output voltage and current efficiency at corners Vin $=0.9 \mathrm{~V}$ and Vin $=0.5 \mathrm{~V}$, demonstrating high accuracy and efficiency. The measured SCR-DLDO output and efficiency are compared to the measured results of a recursive DLDO.

Circuits Conference (ISSCC) Digest of Technical Papers, Feb. 2018.


Figure 8.5: Measured dynamic response of the SCR-DLDO to an on-chip periodic load-step demonstrating 2.48 ns response time at 36.9 ps FOM (top). Illustration of the efficacy of the proposed binary ripple control scheme in mitigating the SCR ripple at light loads and small Vout values (bottom).

| Design | Huang, ISSCC'17 | Tsou, ISSCC'17 | Kim, ISSCC'17 | Salem, ISSCC'17 | This Work |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Process | 65 nm | 40 nm | 65 nm | 65 nm | 65 nm |
| Active area $\left[\mathrm{mm}^{2}\right]$ | 0.03 | 0.193 | 0.03 | 0.0023 | 0.00137 |
| $\mathrm{V}_{\text {in }}{ }^{3}$ [V] | 0.6 | 0.6-1.1 | 0.45-1 | 0.5-1 | 0.5-0.9 |
| $\mathrm{V}_{\text {out }}{ }^{3}$ [V] | 0.5 | 0.5-1.0 | 0.4-0.95 | 0.3-0.45 | 0.3-0.8 |
| Loop Actuator | 9 P PMOS array | 7 P PMOS array | 4x 7b PMOS array | 7 P PMOS array | SC Resistance |
| Control Loop | Barrel Shifter+Analog Assisted | PID | Event-Driven multi-bit ADC | SAR/PD/PWM | Hysteretic Relaxation Oscillator |
| Limit-Cycle Capability for Accuracy Improvement | No limit-cycle | No limit-cycle | No limit-cycle | No limit-cycle | Not Applicable |
| Load range | $\begin{gathered} 2 \mathrm{~m}-12 \mathrm{~mA} \\ (6 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} 15 \mathrm{~mA}-210 \mathrm{~mA} \\ (14 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} 14 \mu \mathrm{~A}-3.36 \mathrm{~mA} \\ \left(\sim 100 \mathrm{x}^{*}\right) \\ \hline \end{gathered}$ | $\begin{gathered} 100 \mathrm{nA}-2 \mathrm{~mA} \\ (20,000 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} 10 \mu \mathrm{~A}-3 \mathrm{~mA} \\ (300 \mathrm{x}) \\ \hline \end{gathered}$ |
| Load range with $\eta>90 \%$ | N.R. | N.R. | $\sim 10 \mathrm{x}$ | $\begin{gathered} 33.6 \mu \mathrm{~A}-2 \mathrm{~mA} \\ (60 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} 10 \mu \mathrm{~A}-1.75 \mathrm{~mA} \\ (175 \mathrm{x}) \\ \hline \end{gathered}$ |
| $\mathrm{C}_{\mathrm{L}}[\mathrm{nF}] /$ Total C | 0/0.1 | 20/20 | $0.1 / 0.1$ | 0.4/0.4 | 0.165/ 0.365 |
| Quiescent $\mathrm{I}_{\mathrm{Q}}[\mu \mathrm{A}]$ during load transient test | N.R. | N.R. | 258 | 14 | 48.4 |
| $\mathrm{V}_{\text {droop }} @$ load step size for load transient test | 105 mV @ 10mA | 38mV @ 200mA | 34 mV @ 1.44mA | 40 mV @ 1.06mA | 20.5 mV @ 3.25mA |
| Response time ${ }^{1} \mathrm{~T}_{\mathrm{R}}[\mathrm{ns}]$ from relation: $\mathrm{C}_{\text {out }} \mathrm{V}_{\text {droop }} / \Delta I$ | $1.05{ }^{\dagger}$ | $1000^{\text {tt }}$ | 2.36 | 15.1 | 2.3 (2.48 measured) |
| FOM $^{2}$ for load transient test [ns] | - | $0.493{ }^{\text {tt }}$ | 0.423 | 0.199 | 0.0343 (0.0369) |
| Load step rise/fall time for load transient test ${ }^{2}$ | $1 \mathrm{n}^{\dagger}$ | 1000 | N.R. | $<1 \mathrm{~ns}$ | <200ps |
| Peak current efficiency $\mathrm{\eta}$ [\%] | N.R. | N.R. | 99.2 | 99.8 | 99.3 |
| Sampling clock range | 10MHz | N.R. | Not Applicable | $\begin{gathered} 1 \mathrm{MHz}-240 \mathrm{MHz} \\ (240 \mathrm{x}) \\ \hline \end{gathered}$ | $\begin{gathered} \hline 100 \mathrm{~K}-1.55 \mathrm{GHz} \\ (15,500 \mathrm{x}) \\ \hline \end{gathered}$ |
| Steady-state error (mV) | N.R. | <150 ${ }^{\text {ftt }}$ | <15 | <5.2 | <1.55 |
| DC Accuracy: peak-error $/ V_{\text {in }}$ | N.R. | $\pm 13.6 \%{ }^{\text {Tt }}$ | $\pm 1.5 \%$ | 1.04\% | $\pm 0.17 \%$ |
| Load regulation: worst peakpeak error $/ \Delta \mathrm{I}[\mathrm{mV} / \mathrm{mA}]$ | N.R. | 0.8 | <15 | < 11.3 across range | < 1.03 across range |


| N.R. = Not Reported | $\dagger I_{Q}$ consumption during transient test was not reported. Also, the reported |
| :---: | :---: |
| ${ }^{1}$ load rise/fall time should be $<\mathrm{T}_{\mathrm{R}} / 10$ for a valid unit-step FOM | $\mathrm{T}_{\mathrm{R}}$ is approximately the load rise/fall time, and hence the reported FOM is |
| measurement | for the unit-ramp response and not the unit-step response. |
| ${ }^{2}$ FOM $=C_{L} V_{\text {droop }} /\left(I_{\text {max }}-I_{\text {min }}\right) \times I_{Q} /\left(I_{\text {max }}-1 I_{\text {min }}\right)$, P. Hazucha et al., JSSC'05 | ${ }^{\dagger+}$ Observed from transient measurement |
| ${ }^{3}$ Measured ranges are only depicted | ${ }^{\mathrm{tt}}$ Best-case theoretical value across the reported $15-150 \mathrm{~mA}$ (10x) range <br> * For the same V , |

Figure 8.6: Comparison of the proposed SCR-DLDO with state-of-the-art switch-array DLDOs illustrating the smallest area, best FOM, and highest accuracy, enabling a realistic industrycompliant digital replacement to analog LDOs for 3.1 mV step DVS and adaptive voltage scaling applications.


Figure 8.7: Micrograph of the fabricated SCR-DLDO chip.

## Bibliography

[1] H. P. Le, J. Crossley, S. R. Sanders, and E. Alon, "A sub-ns response fully integrated batteryconnected switched-capacitor voltage regulator delivering $0.19 \mathrm{~W} / \mathrm{mm} 2$ at $73 \%$ efficiency," Digest of Technical Papers - IEEE International Solid-State Circuits Conference, vol. 56, pp. 372-373, 2013.
[2] G. Rincon-Mora and P. Allen, "A low-voltage, low quiescent current, low drop-out regulator," IEEE Journal of Solid-State Circuits, vol. 33, no. 1, pp. 36-44, 1998.
[3] P. Mok, "A capacitor-free cmos low-dropout regulator with damping-factor-control frequency compensation," IEEE Journal of Solid-State Circuits, vol. 38, pp. 1691-1702, oct 2003.
[4] Y. K. Ramadass and A. P. Chandrakasan, "Minimum Energy Tracking Loop With Embedded DCDC Converter Enabling Ultra-Low-Voltage Operation Down to 250 mV in 65 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 43, pp. 256-265, jan 2008.
[5] Y. K. Ramadass, A. A. Fayed, and A. P. Chandrakasan, "A Fully-Integrated SwitchedCapacitor Step-Down DC-DC Converter With Digital Capacitance Modulation in 45 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 45, pp. 2557-2565, dec 2010.
[6] B. H. Calhoun and A. P. Chandrakasan, "Ultra-dynamic Voltage scaling (UDVS) using sub-threshold operation and local Voltage dithering," IEEE Journal of Solid-State Circuits, vol. 41, pp. 238-245, Jan 2006.
[7] S. Bandyopadhyay, Y. K. Ramadass, and A. P. Chandrakasan, " A $20 \mu$ to 100 mA DCDC Converter With 2.8-4.2 V Battery Supply for Portable Applications in 45 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 46, pp. 2807-2820, dec 2011.
[8] P. Hazucha, T. Karnik, B. Bloechel, C. Parsons, D. Finan, and S. Borkar, "Area-efficient linear regulator with ultra-fast load regulation," IEEE Journal of Solid-State Circuits, vol. 40, pp. 933-940, apr 2005.
[9] G. Schrom, P. Hazucha, F. Paillet, D. J. Rennie, S. T. Moon, D. S. Gardner, T. Kamik, P. Sun, T. T. Nguyen, M. J. Hill, K. Radhakrishnan, and T. Memioglu, "A 100MHz EightPhase Buck Converter Delivering 12A in 25mm2 Using Air-Core Inductors," in APEC

07-Twenty-Second Annual IEEE Applied Power Electronics Conference and Exposition, pp. 727-730, IEEE, feb 2007.
[10] P. Li, L. Xue, P. Hazucha, T. Karnik, and R. Bashirullah, "A Delay-Locked Loop Synchronization Scheme for High-Frequency Multiphase Hysteretic DC-DC Converters," IEEE Journal of Solid-State Circuits, vol. 44, pp. 3131-3145, nov 2009.
[11] N. Sturcken, M. Petracca, S. Warren, P. Mantovani, L. P. Carloni, A. V. Peterchev, and K. L. Shepard, "A Switched-Inductor Integrated Voltage Regulator With Nonlinear Feedback and Network-on-Chip Load in 45 nm SOI," IEEE Journal of Solid-State Circuits, vol. 47, pp. 1935-1945, aug 2012.
[12] C. Huang and P. K. T. Mok, "A $100 \mathrm{MHz} 82.4 \%$ Efficiency Package-Bondwire Based Four-Phase Fully-Integrated Buck Converter With Flying Capacitor for Area Reduction," IEEE Journal of Solid-State Circuits, vol. 48, pp. 2977-2988, dec 2013.
[13] H.-P. Le, S. R. Sanders, and E. Alon, "Design Techniques for Fully Integrated SwitchedCapacitor DC-DC Converters," IEEE Journal of Solid-State Circuits, vol. 46, pp. 21202131, sep 2011.
[14] T. M. Van Breussegem and M. S. J. Steyaert, "Monolithic Capacitive DC-DC Converter With Single BoundaryMultiphase Control and Voltage Domain Stacking in 90 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 46, pp. 1715-1727, jul 2011.
[15] R. Jain, B. M. Geuskens, S. T. Kim, M. M. Khellah, J. Kulkarni, J. W. Tschanz, and V. De, "A 0.451 V Fully-Integrated Distributed Switched Capacitor DC-DC Converter With High Density MIM Capacitor in 22 nm Tri-Gate CMOS," IEEE Journal of Solid-State Circuits, vol. 49, pp. 917-927, apr 2014.
[16] D. Somasekhar, B. Srinivasan, G. Pandya, F. Hamzaoglu, M. Khellah, T. Karnik, and K. Zhang, "Multi-Phase 1 GHz Voltage Doubler Charge Pump in 32 nm Logic Process," IEEE Journal of Solid-State Circuits, vol. 45, pp. 751-758, apr 2010.
[17] L. Chang, R. K. Montoye, B. L. Ji, A. J. Weger, K. G. Stawiasz, and R. H. Dennard, "A fully-integrated switched-capacitor $2: 1$ voltage converter with regulation capability and $90 \%$ efficiency at $2.3 \mathrm{~A} / \mathrm{mm} 2$," in 2010 Symposium on VLSI Circuits, pp. 55-56, IEEE, jun 2010.
[18] H.-P. Le, J. Crossley, S. R. Sanders, and E. Alon, "A sub-ns response fully integrated battery-connected switched-capacitor voltage regulator delivering $0.19 \mathrm{~W} / \mathrm{mm} 2$ at $73 \%$ efficiency," in 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 372-373, IEEE, feb 2013.
[19] T. M. Andersen, F. Krismer, J. W. Kolar, T. Toifl, C. Menolfi, L. Kull, T. Morf, M. Kossel, M. Brandli, P. Buchmann, and P. A. Francese, "A 4.6W/mm2 power density $86 \%$ efficiency on-chip switched capacitor DC-DC converter in 32 nm SOI CMOS," in 2013 Twenty-Eighth

Annual IEEE Applied Power Electronics Conference and Exposition (APEC), pp. 692-699, IEEE, mar 2013.
[20] T. M. Andersen, F. Krismer, J. W. Kolar, T. Toifl, C. Menolfi, L. Kull, T. Morf, M. Kossel, M. Brandli, P. Buchmann, and P. A. Francese, "4.7 A sub-ns response on-chip switchedcapacitor DC-DC voltage regulator delivering $3.7 \mathrm{~W} / \mathrm{mm} 2$ at $90 \%$ efficiency using deeptrench capacitors in 32nm SOI CMOS," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 90-91, IEEE, feb 2014.
[21] M. D. Seeman and S. R. Sanders, "Analysis and Optimization of Switched-Capacitor DCDC Converters," IEEE Transactions on Power Electronics, vol. 23, pp. 841-851, mar 2008.
[22] D. El-Damak, S. Bandyopadhyay, and A. P. Chandrakasan, "A 93\% efficiency reconfigurable switched-capacitor DC-DC converter using on-chip ferroelectric capacitors," in 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 374-375, IEEE, feb 2013.
[23] Y. K. Ramadass and A. P. Chandrakasan, "Voltage Scalable Switched Capacitor DC-DC Converter for Ultra-Low-Power On-Chip Applications," in 2007 IEEE Power Electronics Specialists Conference, pp. 2353-2359, IEEE, 2007.
[24] T. V. Breussegem and M. Steyaert, "A $82 \%$ efficiency $0.5 \%$ ripple 16-phase fully integrated capacitive voltage doubler," 2009.
[25] L. G. Salem and P. P. Mercier, "4.6 An 85\%-efficiency fully integrated 15-ratio recursive switched-capacitor DC-DC converter with 0.1-to-2.2V output voltage range," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 88-89, IEEE, feb 2014.
[26] M. Evzelman and S. Ben-Yaakov, "Average-Current-Based Conduction Losses Model of Switched Capacitor Converters," IEEE Transactions on Power Electronics, vol. 28, pp. 3341-3352, jul 2013.
[27] M. D. Seeman, V. W. Ng, H.-P. Le, M. John, E. Alon, and S. R. Sanders, "A comparative analysis of Switched-Capacitor and inductor-based DC-DC conversion technologies," in 2010 IEEE 12th Workshop on Control and Modeling for Power Electronics (COMPEL), pp. 1-7, IEEE, jun 2010.
[28] S. R. Sanders, E. Alon, H.-P. Le, M. D. Seeman, M. John, and V. W. Ng, "The Road to Fully Integrated DCDC Conversion via the Switched-Capacitor Approach," IEEE Transactions on Power Electronics, vol. 28, pp. 4146-4155, sep 2013.
[29] J. Brugler, "Theoretical performance of voltage multiplier circuits," IEEE Journal of Solid-State Circuits, vol. 6, pp. 132-135, jun 1971.
[30] J. Dickson, "On-chip high-voltage generation in MNOS integrated circuits using an improved voltage multiplier technique," IEEE Journal of Solid-State Circuits, vol. 11, pp. 374378, jun 1976.
[31] M. Makowski and D. Maksimovic, "Performance limits of switched-capacitor DC-DC converters," in Proceedings of PESC '95-Power Electronics Specialist Conference, vol. 2, pp. 1215-1221, IEEE, 1995.
[32] S. Bang, A. Wang, B. Giridhar, D. Blaauw, and D. Sylvester, "A fully integrated successiveapproximation switched-capacitor DC-DC converter with 31 mV output voltage resolution," in 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 370-371, IEEE, feb 2013.
[33] M. D. Seeman, A Design Methodology for Switched-Capacitor DC-DC Converters. Ph.d. thesis, University of California, Berkeley, 2009.
[34] L. G. Salem and P. P. Mercier, "A Recursive Switched-Capacitor DC-DC Converter Achieving $2^{N}-1$ Ratios With High Efficiency Over a Wide Output Voltage Range," IEEE Journal of Solid-State Circuits, vol. 49, pp. 2773-2787, dec 2014.
[35] Y. K. Ramadass, A. A. Fayed, and A. P. Chandrakasan, "A Fully-Integrated SwitchedCapacitor Step-Down DC-DC Converter With Digital Capacitance Modulation in 45 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 45, pp. 2557-2565, dec 2010.
[36] T. M. Andersen, F. Krismer, J. W. Kolar, T. Toifl, C. Menolfi, L. Kull, T. Morf, M. Kossel, M. Brandii, and P. A. Francese, "20.3 A feedforward controlled on-chip switched-capacitor voltage regulator delivering 10W in 32nm SOI CMOS," in 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, pp. 1-3, IEEE, feb 2015.
[37] S. Rajapandian, K. Shepard, P. Hazucha, and T. Karnik, "High-Voltage Power Delivery Through Charge Recycling," IEEE Journal of Solid-State Circuits, vol. 41, pp. 1400-1410, jun 2006.
[38] S. K. Lee, T. Tong, X. Zhang, D. Brooks, and G.-Y. Wei, "A 16-core voltage-stacked system with an integrated switched-capacitor DC-DC converter," in 2015 Symposium on VLSI Circuits (VLSI Circuits), pp. C318-C319, IEEE, jun 2015.
[39] C. Schaef and J. T. Stauth, "Efficient Voltage Regulation for Microprocessor Cores Stacked in Vertical Voltage Domains," IEEE Transactions on Power Electronics, vol. 31, pp. 17951808, feb 2016.
[40] T. Sakurai and A. Newton, "Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas," IEEE Journal of Solid-State Circuits, vol. 25, pp. 584-594, apr 1990.
[41] R. Jevtic, M. Blagojevic, S. Bailey, K. Asanovic, E. Alon, and B. Nikolic, "Per-Core DVFS With Switched-Capacitor Converters for Energy Efficiency in Manycore Processors," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, pp. 723-730, apr 2015.
[42] B. Calhoun, A. Wang, and A. Chandrakasan, "Modeling and sizing for minimum energy operation in subthreshold circuits," IEEE Journal of Solid-State Circuits, vol. 40, pp. 17781786, sep 2005.
[43] B. Calhoun and A. Chandrakasan, "Ultra-Dynamic Voltage Scaling (UDVS) Using SubThreshold Operation and Local Voltage Dithering," IEEE Journal of Solid-State Circuits, vol. 41, pp. 238-245, jan 2006.
[44] W. Dally and J. Poulton, Digital Systems Engineering. Cambridge, UK: Cambridge University Press, 1998.
[45] N. Weste and D. Harris, CMOS VLSI Design. Addison-Wesley, 2011.
[46] "International Technology Roadmap for Semiconductors," 2011.
[47] L. G. Salem and P. P. Mercier, "A footprint-constrained efficiency roadmap for on-chip switched-capacitor DC-DC converters," in 2015 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2321-2324, IEEE, may 2015.
[48] "LTC1044 - Switched Capacitor Voltage Converter," Linear Technology, 2012.
[49] L. G. Salem, J. F. Buckwalter, and P. P. Mercier, "A recursive house-of-cards digital power amplifier employing a $\lambda / 4$-less doherty power combiner in 65 nm cmos," in Proceedings of the IEEE 2016 European Solid-State Circuits Conference, Sept 2016.
[50] L. Chua, C. Desoer, and E. Kuh, Linear and Nonlinear Circuits. New York: McGraw Hill, 1987.
[51] L. G. Salem and P. P. Mercier, "A battery-connected 24-ratio switched capacitor PMIC achieving 95.5\%-efficiency," in 2015 Symposium on VLSI Circuits (VLSI Circuits), pp. C340-C341, IEEE, jun 2015.
[52] L. G. Salem and P. P. Mercier, "A single-inductor 7+7 ratio reconfigurable resonant switched-capacitor DC-DC converter with 0.1 -to- 1.5 V output voltage range," in 2015 IEEE Custom Integrated Circuits Conference (CICC), pp. 1-4, IEEE, sep 2015.
[53] L. G. Salem, J. G. Louie, and P. P. Mercier, "12.9 A flying-domain DC-DC converter powering a Cortex-M0 processor with $90.8 \%$ efficiency," in 2016 IEEE International Solid-State Circuits Conference (ISSCC), pp. 234-236, IEEE, jan 2016.
[54] S. Bang, J.-s. Seo, I. Lee, S. Jeong, N. Pinckney, D. Blaauw, D. Sylvester, and L. Chang, "A fully-integrated 40-phase flying-capacitance-dithered switched-capacitor voltage regulator with 6 mV output ripple," in 2015 Symposium on VLSI Circuits (VLSI Circuits), pp. C336C337, IEEE, jun 2015.
[55] B. Wicht, T. Nirschl, and D. Schmitt-Landsiedel, "Yield and speed optimization of a latchtype voltage sense amplifier," IEEE Journal of Solid-State Circuits, vol. 39, pp. 1148-1158, July 2004.
[56] L. G. Salem and P. P. Mercier, "A 45-ratio recursively sliced series-parallel switchedcapacitor dc-dc converter achieving 86\% efficiency," in Proceedings of the IEEE 2014 Custom Integrated Circuits Conference, pp. 1-4, Sept 2014.
[57] C. Schaef and J. T. Stauth, "A 3-Phase Resonant Switched Capacitor Converter Delivering 7.7 W at $85 \%$ Efficiency Using 1.1 nH PCB Trace Inductors," IEEE Journal of Solid-State Circuits, vol. 50, pp. 2861-2869, dec 2015.
[58] "LPC11Axx 32-bit ARM Cortex-M0 microcontroller; up to 32 kB flash, 8 kB SRAM, 4 kB EEPROM; configurable analog/mixed-signal," NXP Semiconductors, October 2012.
[59] "MCB1000 - Evaluation Board and Starter Kit," arm KEIL, 2016.
[60] J. Myers, A. Savanth, D. Howard, R. Gaddh, P. Prabhat, and D. Flynn, "8.1 An 80nW retention $11.7 \mathrm{pJ} /$ cycle active subthreshold ARM Cortex-M0+ subsystem in 65 nm CMOS for WSN applications," in 2015 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, pp. 1-3, IEEE, feb 2015.
[61] P. Haldi, D. Chowdhury, P. Reynaert, G. Liu, and A. M. Niknejad, "A 5.8 GHz 1 V linear power amplifier using a novel on-chip transformer power combiner in standard 90 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 43, no. 5, pp. 1054-1062, 2008.
[62] T. Sowlati and D. Leenaerts, "A 2.4-ghz $0.18-\mu \mathrm{m}$ cmos self-biased cascode power amplifier," IEEE Journal of Solid-State Circuits, vol. 38, pp. 1318-1324, aug 2003.
[63] J. McRory, G. Rabjohn, and R. Johnston, "Transformer coupled stacked FET power amplifiers," IEEE Journal of Solid-State Circuits, vol. 34, no. 2, pp. 157-161, 1999.
[64] S. Pornpromlikit, Jinho Jeong, C. Presti, A. Scuderi, and P. Asbeck, "A Watt-Level Stacked-FET Linear Power Amplifier in Silicon-on-Insulator CMOS," IEEE Transactions on Microwave Theory and Techniques, vol. 58, pp. 57-64, jan 2010.
[65] L. Kahn, "Single-Sideband Transmission by Envelope Elimination and Restoration," Proceedings of the IRE, vol. 40, pp. 803-806, jul 1952.
[66] P. Reynaert and M. Steyaert, "A 1.75-GHz polar modulated CMOS RF power amplifier for GSM-EDGE," IEEE Journal of Solid-State Circuits, vol. 40, pp. 2598-2608, dec 2005.
[67] J. S. Walling, S. S. Taylor, and D. J. Allstot, "A Class-G Supply Modulator and Class-E PA in 130 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 44, pp. 2339-2347, sep 2009.
[68] P. A. Godoy, S. Chung, T. W. Barton, D. J. Perreault, and J. L. Dawson, "A 2.4-GHz, $27-\mathrm{dBm}$ Asymmetric Multilevel Outphasing Power Amplifier in 65-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 47, pp. 2372-2384, oct 2012.
[69] J. S. Walling, H. Lakdawala, Y. Palaskas, A. Ravi, O. Degani, K. Soumyanath, and D. J. Allstot, "A Class-E PA With Pulse-Width and Pulse-Position Modulation in 65 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 44, pp. 1668-1678, jun 2009.
[70] A. Kavousian, D. K. Su, M. Hekmat, A. Shirvani, and B. A. Wooley, "A Digitally Modulated Polar CMOS Power Amplifier With a 20-MHz Channel Bandwidth," IEEE Journal of Solid-State Circuits, vol. 43, pp. 2251-2258, oct 2008.
[71] I. Aoki, S. Kee, R. Magoon, R. Aparicio, F. Bohn, J. Zachan, G. Hatcher, D. McClymont, and A. Hajimiri, "A Fully-Integrated Quad-Band GSM/GPRS CMOS Power Amplifier," IEEE Journal of Solid-State Circuits, vol. 43, pp. 2747-2758, dec 2008.
[72] K. H. An, O. Lee, H. Kim, D. H. Lee, J. Han, K. S. Yang, Y. Kim, J. J. Chang, W. Woo, C. H. Lee, H. Kim, and J. Laskar, "Power-combining transformer techniques for fullyintegrated CMOS power amplifiers," IEEE Journal of Solid-State Circuits, vol. 43, no. 5, pp. 1064-1074, 2008.
[73] S.-M. Yoo, J. S. Walling, E. C. Woo, B. Jann, and D. J. Allstot, "A Switched-Capacitor RF Power Amplifier," IEEE Journal of Solid State Circuits, vol. 46, pp. 2977-2987, dec 2011.
[74] S.-M. Yoo, J. S. Walling, O. Degani, B. Jann, R. Sadhwani, J. C. Rudell, and D. J. Allstot, "A Class-G Switched-Capacitor RF Power Amplifier," IEEE Journal of Solid-State Circuits, vol. 48, pp. 1212-1224, may 2013.
[75] L. G. Salem, J. F. Buckwalter, and P. P. Mercier, "A recursive house-of-cards digital power amplifier employing a $\lambda / 4$-less Doherty power combiner in 65 nm CMOS," in ESSCIRC Conference 2016: 42nd European Solid-State Circuits Conference, pp. 189-192, IEEE, sep 2016.
[76] D. Su and W. McFarland, "An IC for linearizing RF power amplifiers using envelope elimination and restoration," IEEE Journal of Solid-State Circuits, vol. 33, no. 12, pp. 22522258, 1998.
[77] F. Raab, B. Sigmon, R. Myers, and R. Jackson, "L-band transmitter using Kahn EER technique," IEEE Transactions on Microwave Theory and Techniques, vol. 46, no. 12, pp. 2220-2225, 1998.
[78] F. Wang, D. F. Kimball, J. D. Popp, A. H. Yang, D. Y. Lie, P. M. Asbeck, and L. E. Larson, "An Improved Power-Added Efficiency 19-dBm Hybrid Envelope Elimination and Restoration Power Amplifier for 802.11g WLAN Applications," IEEE Transactions on Microwave Theory and Techniques, vol. 54, pp. 4086-4099, dec 2006.
[79] M. Hassan, L. E. Larson, V. W. Leung, and P. M. Asbeck, "A combined series-parallel hybrid envelope amplifier for envelope tracking mobile terminal RF power amplifier applications," IEEE Journal of Solid-State Circuits, vol. 47, no. 5, pp. 1185-1198, 2012.
[80] L. G. Salem and P. P. Mercier, "A recursive switched-capacitor DC-DC converter achieving $2 \mathrm{~N}-1$ ratios with high efficiency over a wide output voltage range," IEEE Journal of Solid-State Circuits, vol. 49, pp. 2773-2787, dec 2014.
[81] L. G. Salem, J. Warchall, and P. P. Mercier, "A 100nA-to-2mA successive-approximation digital LDO with PD compensation and sub-LSB duty control achieving a 15.1 ns response time at 0.5 V ," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 340-341, IEEE, feb 2017.
[82] A. Afsahi, A. Behzad, and L. E. Larson, "A 65 nm CMOS 2.4 GHz 31.5 dBm power amplifier with a distributed LC power-combining network and improved linearization for WLAN applications," in 2010 IEEE International Solid-State Circuits Conference (ISSCC), pp. 452-453, IEEE, feb 2010.
[83] A. Afsahi and L. E. Larson, "Monolithic power-combining techniques for watt-level 2.4ghz cmos power amplifiers for wlan applications," IEEE Transactions on Microwave Theory and Techniques, vol. 61, pp. 1247-1260, March 2013.
[84] A. Niknejad and R. Meyer, "Analysis, design, and optimization of spiral inductors and transformers for Si RF ICs," IEEE Journal of Solid-State Circuits, vol. 33, no. 10, pp. 14701481, 1998.
[85] J. Long and M. Copeland, "The modeling, characterization, and design of monolithic inductors for silicon RF IC's," IEEE Journal of Solid-State Circuits, vol. 32, pp. 357-369, mar 1997.
[86] I. Aoki, S. Kee, D. Rutledge, and A. Hajimiri, "Fully integrated CMOS power amplifier design using the distributed active-transformer architecture," IEEE Journal of Solid-State Circuits, vol. 37, pp. 371-383, mar 2002.
[87] A. K. Ezzeddine, H. C. Huang, and J. L. Singer, "Uhifet - a new high-frequency highvoltage device," in 2011 IEEE MTT-S International Microwave Symposium, pp. 1-4, June 2011.
[88] S. Leuschner, J.-E. Mueller, and H. Klar, "A 1.8 GHz wide-band stacked-cascode CMOS power amplifier for WCDMA applications in 65 nm standard CMOS," in 2011 IEEE Radio Frequency Integrated Circuits Symposium, pp. 1-4, IEEE, jun 2011.
[89] J. Long, "Monolithic transformers for silicon RF IC design," IEEE Journal of Solid-State Circuits, vol. 35, pp. 1368-1382, sep 2000.
[90] Lei Wu, I. Dettmann, and M. Berroth, "A $900-\mathrm{MHz} 29.5-\mathrm{dBm} 0.13-\mu \mathrm{m}$ CMOS HiVP Power Amplifier," IEEE Transactions on Microwave Theory and Techniques, vol. 56, pp. 2040-2045, sep 2008.
[91] Y. Lu, Y. Wang, Q. Pan, W. H. Ki, and C. P. Yue, "A Fully-Integrated Low-Dropout Regulator With Full-Spectrum Power Supply Rejection," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 62, pp. 707-716, March 2015.
[92] L. G. Salem, J. G. Louie, and P. P. Mercier, "Flying-Domain DCDC Power Conversion," IEEE Journal of Solid-State Circuits, vol. 51, pp. 2830-2842, dec 2016.
[93] R. Staszewski, R. B. Staszewski, T. Jung, T. Murphy, I. Bashir, O. Eliezer, K. Muhammad, and M. Entezari, "Software Assisted Digital RF Processor (DRP) for Single-Chip GSM Radio in 90 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 45, pp. 276-288, feb 2010.
[94] S. Hu, S. Kousai, J. S. Park, O. L. Chlieh, and H. Wang, "Design of A Transformer-Based Reconfigurable Digital Polar Doherty Power Amplifier Fully Integrated in Bulk CMOS," IEEE Journal of Solid-State Circuits, vol. 50, pp. 1094-1106, may 2015.
[95] E. Kaymaksut and P. Reynaert, "Dual-Mode CMOS Doherty LTE Power Amplifier With Symmetric Hybrid Transformer," IEEE Journal of Solid-State Circuits, vol. 50, pp. 19741987, sep 2015.
[96] V. Vorapipat, C. Levy, and P. Asbeck, "A wideband voltage mode Doherty power amplifier," in 2016 IEEE Radio Frequency Integrated Circuits Symposium (RFIC), pp. 266-269, IEEE, may 2016.
[97] S. C. Chan, P. J. Restle, K. L. Shepard, N. K. James, and R. L. Franch, "A 4.6ghz resonant global clock distribution network," in 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519), pp. 342-343 Vol.1, Feb 2004.
[98] P. Restle, D. Shan, D. Hogenmiller, Y. Kim, A. Drake, J. Hibbeler, T. Bucelot, G. Still, K. Jenkins, and J. Friedrich, "Wide-frequency-range resonant clock with on-the-fly mode changing for the power8tm microprocessor," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 100-101, Feb 2014.
[99] H. Fuketa, M. Nomura, M. Takamiya, and T. Sakurai, "Intermittent resonant clocking enabling power reduction at any clock frequency for 0.37v 980khz near-threshold logic circuits," in 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 436-437, Feb 2013.
[100] F. U. Rahman and V. S. Sathe, "19.6 voltage-scalable frequency-independent quasi-resonant clocking implementation of a 0.7 -to-1.2v dvfs system," in 2016 IEEE International SolidState Circuits Conference (ISSCC), pp. 334-335, Jan 2016.
[101] A. Wang and A. Chandrakasan, "A $180-\mathrm{mV}$ subthreshold FFT processor using a minimum energy design methodology," IEEE Journal of Solid-State Circuits, vol. 40, pp. 310-319, Jan 2005.
[102] H. Kaul, M. Anders, S. Mathew, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar, "A $320 \mathrm{mV} 56 \mu \mathrm{~W} 411 \mathrm{GOPS} / \mathrm{Watt}$ Ultra-Low Voltage Motion Estimation Accelerator in 65 nm CMOS," in 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, pp. 316-616, Feb 2008.
[103] J. Kwong, Y. K. Ramadass, N. Verma, and A. P. Chandrakasan, "A 65 nm Sub- $V_{t}$ Microcontroller With Integrated SRAM and Switched Capacitor DC-DC Converter," IEEE Journal of Solid-State Circuits, vol. 44, pp. 115-126, Jan 2009.
[104] A. Agarwal, S. K. Mathew, S. K. Hsu, M. A. Anders, H. Kaul, F. Sheikh, R. Ramanarayanan, S. Srinivasan, R. Krishnamurthy, and S. Borkar, "A 320mV-to-1.2V on-die fine-grained reconfigurable fabric for DSP/media accelerators in 32nm CMOS," in 2010 IEEE International Solid-State Circuits Conference - (ISSCC), pp. 328-329, Feb 2010.
[105] N. Lotze and Y. Manoli, "A 62 mV 0.13 ти m CMOS Standard-Cell-Based Design Technique Using Schmitt-Trigger Logic," IEEE Journal of Solid-State Circuits, vol. 47, pp. 47-60, Jan 2012.
[106] M. Fojtik, D. Kim, G. Chen, Y. S. Lin, D. Fick, J. Park, M. Seok, M. T. Chen, Z. Foo, D. Blaauw, and D. Sylvester, "A Millimeter-Scale Energy-Autonomous Sensor System With Stacked Battery and Solar Cells," IEEE Journal of Solid-State Circuits, vol. 48, pp. 801-813, March 2013.
[107] W. Lim, I. Lee, D. Sylvester, and D. Blaauw, "8.2 Batteryless Sub-nW Cortex-M0+ processor with dynamic leakage-suppression logic," in 2015 IEEE International SolidState Circuits Conference - (ISSCC) Digest of Technical Papers, pp. 1-3, Feb 2015.
[108] G. A. Rincon-Mora and P. E. Allen, "A low-voltage, low quiescent current, low drop-out regulator," IEEE Journal of Solid-State Circuits, vol. 33, pp. 36-44, Jan 1998.
[109] K. N. Leung and P. K. T. Mok, "A capacitor-free CMOS low-dropout regulator with damping-factor-control frequency compensation," IEEE Journal of Solid-State Circuits, vol. 38, pp. 1691-1702, Oct 2003.
[110] M. Al-Shyoukh, H. Lee, and R. Perez, "A Transient-Enhanced Low-Quiescent Current Low-Dropout Regulator With Buffer Impedance Attenuation," IEEE Journal of Solid-State Circuits, vol. 42, pp. 1732-1742, Aug 2007.
[111] M. Ho, K. N. Leung, and K. L. Mak, "A Low-Power Fast-Transient 90-nm Low-Dropout Regulator With Multiple Small-Gain Stages," IEEE Journal of Solid-State Circuits, vol. 45, pp. 2466-2475, Nov 2010.
[112] M. El-Nozahi, A. Amer, J. Torres, K. Entesari, and E. Sanchez-Sinencio, "High PSR Low Drop-Out Regulator With Feed-Forward Ripple Cancellation Technique," IEEE Journal of Solid-State Circuits, vol. 45, pp. 565-577, March 2010.
[113] J. Guo and K. N. Leung, "A 6-mu W Chip-Area-Efficient Output-Capacitorless LDO in 90-nm CMOS Technology," IEEE Journal of Solid-State Circuits, vol. 45, pp. 1896-1905, Sept 2010.
[114] J. F. Bulzacchelli, Z. Toprak-Deniz, T. M. Rasmus, J. A. Iadanza, W. L. Bucossi, S. Kim, R. Blanco, C. E. Cox, M. Chhabra, C. D. LeBlanc, C. L. Trudeau, and D. J. Friedman, "Dual-Loop System of Distributed Microregulators With High DC Accuracy, Load Response Time Below 500 ps , and $85-\mathrm{mV}$ Dropout Voltage," IEEE Journal of Solid-State Circuits, vol. 47, pp. 863-874, April 2012.
[115] P. Hazucha, S. T. Moon, G. Schrom, F. Paillet, D. Gardner, S. Rajapandian, and T. Karnik, "High Voltage Tolerant Linear Regulator With Fast Digital Control for Biasing of Integrated DC-DC Converters," IEEE Journal of Solid-State Circuits, vol. 42, pp. 66-73, Jan 2007.
[116] Y. Okuma, K. Ishida, Y. Ryu, X. Zhang, P.-H. Chen, K. Watanabe, M. Takamiya, and T. Sakurai, " $0.5-\mathrm{V}$ input digital LDO with $98.7 \%$ current efficiency and $2.7-\mu \mathrm{A}$ quiescent current in 65 nm CMOS," in IEEE Custom Integrated Circuits Conference 2010, pp. 1-4, Sept 2010.
[117] Y. H. Lee, S. Y. Peng, C. C. Chiu, A. C. H. Wu, K. H. Chen, Y. H. Lin, S. W. Wang, T. Y. Tsai, C. C. Huang, and C. C. Lee, "A Low Quiescent Current Asynchronous Digital-LDO With PLL-Modulated Fast-DVS Power Management in 40 nm SoC for MIPS Performance Improvement," IEEE Journal of Solid-State Circuits, vol. 48, pp. 1018-1030, April 2013.
[118] S. Gangopadhyay, D. Somasekhar, J. W. Tschanz, and A. Raychowdhury, "A 32 nm Embedded, Fully-Digital, Phase-Locked Low Dropout Regulator for Fine Grained Power Management in Digital Circuits," IEEE Journal of Solid-State Circuits, vol. 49, pp. 26842693, Nov 2014.
[119] S. Gangopadhyay, Y. Lee, S. B. Nasir, and A. Raychowdhury, "Modeling and analysis of digital linear dropout regulators with adaptive control for high efficiency under wide dynamic range digital loads," in 2014 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1-6, March 2014.
[120] D. Kim and M. Seok, "8.2 Fully integrated low-drop-out regulator based on event-driven PI control," in 2016 IEEE International Solid-State Circuits Conference (ISSCC), pp. 148-149, Jan 2016.
[121] Y. J. Lee, W. Qu, S. Singh, D. Y. Kim, K. H. Kim, S. H. Kim, J. J. Park, and G. H. Cho, "A 200-mA Digital Low Drop-Out Regulator With Coarse-Fine Dual Loop in Mobile Application Processor," IEEE Journal of Solid-State Circuits, vol. 52, pp. 64-76, Jan 2017.
[122] W. J. Tsou, W. H. Yang, J. H. Lin, H. Chen, K. H. Chen, C. L. Wey, Y. H. Lin, S. R. Lin, and T. Y. Tsai, "20.2 Digital low-dropout regulator with anti PVT-variation technique for dynamic voltage scaling and adaptive voltage scaling multicore processor," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 338-339, Feb 2017.
[123] M. Huang, Y. Lu, S. P. U, and R. P. Martins, "20.4 an output-capacitor-free analog-assisted digital low-dropout regulator with tri-loop control," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 342-343, Feb 2017.
[124] D. Kim, J. Kim, H. Ham, and M. Seok, "20.6 A 0.5V-VIN 1.44mA-class event-driven digital LDO with a fully integrated 100 pF output capacitor," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 346-347, Feb 2017.
[125] S. Nasir, S. Gangopadhyay, and A. Raychowdhury, "All-Digital Low-Dropout Regulator with Adaptive Control and Reduced Dynamic Stability for Digital Load Circuits," IEEE Transactions on Power Electronics, vol. 31, no. 12, pp. 1-1, 2016.
[126] F. Yang, Y. Lu, and P. K. T. Mok, "A comparative analysis on binary and multiple-unary weighted power stage design for digital LDO," in 2016 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp. 41-42, Oct 2016.
[127] H. Song, W. Rhee, I. Shim, and Z. Wang, "Digital LDO with 1-bit $\Delta \Sigma$ modulation for low-voltage clock generation systems," Electronics Letters, vol. 52, no. 25, pp. 2034-2036, 2016.
[128] Y. Li, X. Zhang, Z. Zhang, and Y. Lian, "A 0.45-to-1.2-V Fully Digital Low-Dropout Voltage Regulator With Fast-Transient Controller for Near/Subthreshold Circuits," IEEE Transactions on Power Electronics, vol. 31, pp. 6341-6350, Sept 2016.
[129] A. V. Oppenheim, A. S. Willsky, and S. H. Nawab, Signals and Systems (2nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1996.
[130] S. B. Nasir and A. Raychowdhury, "On limit cycle oscillations in discrete-time digital linear regulators," in 2015 IEEE Applied Power Electronics Conference and Exposition (APEC), pp. 371-376, March 2015.
[131] Z. Toprak-Deniz, M. Sperling, J. Bulzacchelli, G. Still, R. Kruse, S. Kim, D. Boerstler, T. Gloekler, R. Robertazzi, K. Stawiasz, T. Diemoz, G. English, D. Hui, P. Muench, and J. Friedrich, " 5.2 distributed system of digitally controlled microregulators enabling per-core dvfs for the power8tm microprocessor," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 98-99, Feb 2014.
[132] T. I. TPS659037, "Power Management IC (PMIC) for ARM Cortex A15 Processors," Texas Instruments.
[133] S. B. Nasir, S. Gangopadhyay, and A. Raychowdhury, "5.6 a $0.13 \mu \mathrm{~m}$ fully digital lowdropout regulator with adaptive control and reduced dynamic stability for ultra-wide dynamic range," in 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, pp. 1-3, Feb 2015.
[134] L. G. Salem, J. Warchall, and P. P. Mercier, "20.3 a 100na-to-2ma successive-approximation digital ldo with pd compensation and sub-lsb duty control achieving a 15.1 ns response time at 0.5 v ," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 340-341, Feb 2017.


[^0]:    ${ }^{1}$ A simple addition of the two loss limits, $R_{S S L}$ and $R_{F S L}$, is used to express the intrinsic RSC loss which overestimates the total $R_{\text {out }}$. A negligible bottom plate parasitics are assumed, besides the equivalent load resistance $R_{L}$ is assumed to be larger than $R_{\text {out }}$, to obtain simple intuitive expressions. The formula for a $2: 1 \mathrm{SC}$ optimal switch width and frequency in [13] are used in the derivation.

[^1]:    ${ }^{1}$ Power switches' drain parasitics are included as part of $P_{\text {Bot-cap }}$.

[^2]:    ${ }^{1}$ A startup circuit is required to ensure that the employed devices are not overstressed during power on. This is typically applied in high-voltage DC-DC converters via inrush current limiting circuitry.

[^3]:    ${ }^{2}$ LP was chosen due to run availability; better performance could have been achieved in a general purpose (GP) process.

[^4]:    ${ }^{1}$ Otherwise, when the rise/fall time of the load-current swing is comparable to $T_{R}$, the measured $T_{R}$ becomes the unit-ramp response time and not the unit-step response time. This invalidates the measured FOM since the output voltage droop, $\Delta V_{\text {droop }}$, due to a unit-ramp current is much smaller than the droop in the unit-step current case.

[^5]:    ${ }^{2} I N C$ pulse is produced by passing $C L K L$ through first edge pass logic, during $P W M$ control.
    ${ }^{3}$ The proposed scheme is similar to branch-prediction solutions that avoid CPU pipeline hazards in computer architecture.

