# More-than-Moore with Integrated Silicon-Photonics

Vladimir Stojanović Berkeley Wireless Rearch Center UC Berkeley



- Milos Popović (Boulder/BU), Rajeev Ram, Jason Orcutt, Hanqing Li (MIT), Krste Asanović (UC Berkeley)
- Jeffrey Shainline, Christopher Batten, Ajay Joshi, Anatoly Khilo
- Mark Wade, Karan Mehta, Jie Sun, Josh Wang
- Chen Sun, Sen Lin, Sajjad Moazeni, Nandish Mehta, Michael Georgas, Benjamin Moss, Jonathan Leu
- Yong-Jin Kwon, Scott Beamer, Yunsup Lee, Andrew Waterman, Miquel Planas, Rimas Avizienis, Henry Cook, Huy Vo
- Roy Meade, Gurtej Sandhu and Micron Fab12 team (Zvi, Ofer, Daniel, Efi, Elad, ...)
- DARPA, Micron, NSF, BWRC
- IBM Trusted Foundry, Global Foundries

#### More-than-Moore perspective

# Enhanced CMOS enables new applications

World's first siPhotonic transmitter in 45nm SOI Stojanovic, Popovic, Ram

2004 World's first 60GHz CMOS Amplifier Niknejad & Brodersen

1997 One of the first CMOS radios Rudell & Gray

| Lon<br>Phase<br>Shifter<br>LNA Mixer<br>LNA Mixer<br>Do Ottr<br>Ourcer AG | LO2<br>Phase<br>Shifter |
|---------------------------------------------------------------------------|-------------------------|
|                                                                           | Inductors ir<br>process |

Inductors in IC process Nguyen & Meyer 1990







2012

# IBM/GF 12SOI (45nm) CMOS



- 300mm wafer, commercial process
- MOSIS and TAPO MPW access
- Advanced process used in microprocessors
- Photonic enhancement enables VLSI photonic systems (no required process changes)

#### IBM Cell



#### **IBM Espresso**





# IBM Power 7

# "Zero-Change" Photonics in 45nm



- Photonics for free! (No modification to the process)
- Closest proximity of electronics and photonics

**B**WRC

• Single substrate removal post-processing step

Monolithic photonics platform with fastest transistors

Integrated photonic interconnects

**B**WRC

111



# Single channel link tradeoffs



# Need to optimize carefully



Moderate data rates most energy-efficient

Georgas CICC 2011

# DWDM link efficiency optimization



- Optimize for min energy-cost
- Bandwidth density dominated by circuit and photonics area (not coupler pitch)

# Towards an Optical DRAM System



70M transistors 1000 optical devices **DARPA POEM** 

Slide 10

#### World's First Processor to Communicate with Light

Silicon-Photonic components integrated directly in the chip DARPA POEM & PERFECT – Stojanović, Asanović



#### Processor Cores – 45nm SOI

Frequency (MHz)

[Lee ESSCIRC 2014]

| Vdd (V)                                                       |  |  |
|---------------------------------------------------------------|--|--|
| 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20   |  |  |
| <b>15.8 12.8 10.6 8.7 7.1 5.7 4.6 3.7 3.1 2.6 2.1 1.6</b> 200 |  |  |
| <b>16.7 14.0 11.6 9.6 7.9 6.4 5.3 4.3 3.6 3.0 2.4 1.9 250</b> |  |  |
| 14.9 12.4 10.3 8.6 7.0 5.8 4.8 4.0 3.3 2.8 2.2 300            |  |  |
| <b>15.6 13.0 10.9 9.1 7.5 6.2 5.2 4.3 3.7 3.0 2.4 350</b>     |  |  |
| 13.6 11.3 9.6 7.9 6.6 5.5 4.7 3.9 3.3 2.6 400                 |  |  |
| 14.1 11.8 9.9 8.3 6.9 5.8 4.9 4.2 3.5 2.8 450                 |  |  |
| 12.2 10.3 8.6 7.2 6.1 5.2 4.4 3.7 3.0 500                     |  |  |
| <b>12.5 10.5 8.8 7.5 6.3 5.4 4.5 3.8 3.1 550</b>              |  |  |
| <b>10.8</b> 9.1 7.7 6.5 5.6 4.7 4.0 3.3 600                   |  |  |
| <b>11.0 9.3 7.9 6.7 5.8 4.9 4.2 3.3 650</b>                   |  |  |
| <b>11.2</b> 9.5 8.1 6.9 5.9 5.0 4.3 3.5 700                   |  |  |
| <b>11.4</b> 9.7 8.3 7.1 6.1 5.2 4.4 3.6 750                   |  |  |
| 9.8 8.4 7.2 6.2 5.3 4.5 3.6 800                               |  |  |
| 8.6 7.4 6.4 5.4 4.6 3.7 850                                   |  |  |
| 8.7 7.5 6.5 5.5 4.7 3.8 900                                   |  |  |
| 8.8 7.6 6.6 5.6 4.8 3.9 950                                   |  |  |
| 7.3 6.6 5.7 4.8 4.0 1000                                      |  |  |
| 6.7 5.7 4.9 4.0 1050                                          |  |  |
|                                                               |  |  |
|                                                               |  |  |
|                                                               |  |  |
| 5.1 4.3 1250                                                  |  |  |
| Not Operational 4.2 1300                                      |  |  |
| 1350                                                          |  |  |

- RISC-V open ISA
- Scalar-vector cores Boot Linux
- 0.2-1.35GHz, 4-16 GFlops/W

### Si Waveguides

**B**WRC

Ш





#### **Vertical couplers**

# Waveguide Diffraction Grating



# Waveguide Taper

#### [Wade OIC 2015]

Slide 14



Slide 15 [Orcutt 2013, Alloatti APL 2015]

#### **Key Device Components**

**B**WRC



# Integrated Heater Output Waveguide

[Shainline OL 2013, Wade OFC 2014]

Slide 16

#### Transmitter



#### Receiver



- Low parasitics from monolithic integration enable single-stage  $5 \mbox{k} \Omega$  TIA receiver
- 10 Gb/s operation at 290 fJ/bit with 8.3uA sensitivity

# 5 Gb/s Chip-to-Chip Link



Slide 19

# 5 Gb/s Link Efficiency Summary



 "zero-change" monolithic competitive with state-of-the-art heterogeneous platforms

**B**WRC

680 fJ/bit, 14mW optical power [Zheng PTL 2012\*\*]

\*Includes all closed-loop circuits + 0.5 nm tuning power

\*\*0.5nm tuning power only Slide 20

# 5 Gb/s Link Efficiency Summary



#### 560 fJ/bit for laser – wall-plug\*\*

**B**WRC

- Not using our best devices in the link
  - 1dB loss couplers [Wade, OIC 2015] (on the same chip instead of 4dB in the link)
  - 5-10x better photodetector (0.1-0.2 A/W photodetector on the same chip)
- Expect to obtain >40x smaller laser power (65fJ/b optical)

\*\*11.6% QD laser wall-plug efficiency

\*Includes all closed-loop circuits + 0.5 nm tuning power

Slide 21

# **BWRC** Electronic-Photonic Packaging

Die-thinned chip with selective substrate removal

111



- Flip-chip onto FR4 PCB using C4 bumps
- Selective substrate removal of optical transceiver regions

# **Optical Memory System Demo**



#### Tx and Rx DWDM Transceiver Banks



## 11 x 8 Gbps Tx Demonstration

- 11 rings, each demonstrating 8 Gbps modulation
  - Independently testing one at a time
  - Potential for 88 Gb/s on a single fiber/waveguide
  - Each ring is auto-locked





#### Going Faster – PAM2 and PAM4



### Chip floorplan



# Transmitter eye diagrams



- Extinction ratio (ER): 3dB, Insertion loss (IL): 5.5dB
- PAM4 coding used: (0,5,10,15)
- 42fJ/b driver energy efficiency



# **Improved Rx Topologies**







- Leverage tight electronic-photonic integration to create new, more sensitive receiver structures
  - Differential, DDR receiver

[Nandish Mehta et al. ESSCIRC16]

# Platform Performance Summary

| Metric                    | [Beamer ISCA 2010]<br>Conservative Estimates | 45nm SOI<br>Platform | Bulk Photonics<br>Platform* |
|---------------------------|----------------------------------------------|----------------------|-----------------------------|
| Waveguide Loss            | 4 dB/cm                                      | 3.7 dB/cm            | 10.5 dB/cm                  |
| Vertical Coupler Loss     | 1 dB                                         | 1 dB                 | 3 dB                        |
| Tx Data Rate              | 10 Gb/s                                      | 20 Gb/s              | 5 Gb/s                      |
| Tx Energy Per Bit         | 120 fJ/b                                     | 42 fJ/b              | 350 fJ/b                    |
| Rx Data Rate              | 10 Gb/s                                      | 12 Gb/s              | 5 Gb/s                      |
| Rx Energy Per Bit         | 80 fJ/b                                      | 297 fJ/b             | 1700 fJ/b                   |
| Rx Sensitivity            | 10 μΑ                                        | 8 μΑ                 | 36 µA                       |
| PD Responsivity           | 0.9 A/W                                      | 0.44 A/W             | 0.2 A/W                     |
| Thermal Tuning Efficiency | 1.6 μW/GHz                                   | 3.8 μW/GHz           | 10 μW/GHz                   |

- Comparison to a proposal for the processor-memory system we published 6 years ago
- Meeting/exceeding most system specs

\*considerably slower process than one assumed in [Beamer ISCA 2010]

# Poly Si Photonics in Bulk CMOS

#### DRAM processes heavily optimized for cost

**B**WRC

Micron wafers





# Memory: Bulk photonics integration

#### **First-ever link result with bulk CMOS photonics Micron D1L Reticle** Chip 1 Chip 2 90/10 Rx Macro Tx Macro Single-Single-Splitter Mode Fiber Mode Fiber Rx 180nm Tx $\lambda_1$ Laser **Bulk chip** Monitor Scope 10<sup>6</sup> Reticle -Rx-1 Chip 1 Chip 2 10<sup>-2</sup> -Rx-0 Bit-Error Bit--01 Rate Tx Macro ` 10<sup>-6</sup> 10<sup>-10</sup> -Rx Macro 10<sup>-12</sup> 50 Time (ps) 100 Tx Single-λ-Macro Rx λ-Slice 100 µm PRBS PRBS BER 100 µm Tuning Tuning Checker Checker Generator Generato 8:2 Ser. 8:2 Ser. 2:8 Des. :8 Des. /ertica Heater Transmitter Receiver Coupler ansmitte Vertical Coupler Output' Dummy Microring Microring Detecto Coupler Modulato Detector [Meade et al. VLSI Tech Symp 14, Sun et al VLSI Ckts Symp 14]

**B**WRC

111

# WDM in bulk-photonics - Tx



- All slices BER checked at 5Gb/s
- 45Gb/s aggregate rate per waveguide

# WDM in bulk-photonics - Rx



- All receive slices functional and BER checked at 5Gb/s
- Single fiber more I/O BW than x16 DDR4 part

- Silicon-photonics enabler of new capabilities
  - Think "new on-chip inductor" or "new on-chip t-line"
- Potentially revolutionize many applications despite slowdown in CMOS scaling
  - VLSI compute and network infrastructure just a start
- Need process, device, circuit and system-level understanding