DISSERTATION

DESIGN AND OPTIMIZATION OF EFFICIENT, FAULT-TOLERANT AND SECURE 2.5D

CHIPLET SYSTEMS

Submitted by

Ebad Taheri

Department of Electrical and Computer Engineering

In partial fulfillment of the requirements

For the Degree of Doctor of Philosophy

Colorado State University

Fort Collins, Colorado

Fall 2024

Doctoral Committee:

Advisor: Mahdi Nikdast

Co-Advisor: Sudeep Pasricha

Yashwant K. Malaiya

Anura P. Jayasumana


Copyright by Ebad Taheri 2024

All Rights Reserved


ABSTRACT

DESIGN AND OPTIMIZATION OF EFFICIENT, FAULT-TOLERANT AND SECURE 2.5D

CHIPLET SYSTEMS

In response to the burgeoning demand for high-performance computing systems, this Ph.D.

dissertation investigates the pivotal challenges surrounding Networks-on-Chip (NoCs) within the

framework of 2.5D and 3D integration technologies, with a primary objective of enhancing the

efficiency, fault tolerance, and security of forthcoming computing system architectures. The in-

herent limitations in bandwidth and reliability at the boundary of chiplets in 2.5D chiplet systems

engender significant challenges in traffic management, latency, and energy efficiency. Further-

more, the interconnected global network on an interposer, linking multiple chiplets, necessitates

high-bandwidth, low-latency communication to accommodate the substantial traffic generated by

numerous cores across diverse chiplets. This Ph.D. dissertation emphasizes various design aspects

of NoCs, such as latency, energy efficiency, fault tolerance, and security. It explores the design of

3D NoCs leveraging Through-Silicon Vias (TSVs) for vertical communication. To address reliabil-

ity concerns and fabrication costs associated with high TSV density, Partially Connected 3D NoC

(PC-3DNoC) has been proposed. An adaptive congestion-aware TSV link selection algorithm is

introduced to manage traffic load and optimize communication, resulting in reduced latency and

improved energy efficiency. For 2.5D chiplet systems, a novel deadlock-free and fault-tolerant

routing algorithm is presented. The fault-tolerant algorithm enhances redundancy in vertical link

selection and offers improved network reachability with reduced latency compared to existing so-

lutions, even in the presence of faults. Furthermore, to address the energy consumption concerns

of silicon-photonic-based 2.5D networks, a reconfigurable power-efficient and congestion-aware

silicon-photonic-based 2.5D Interposer network is proposed. The proposed photonic interposer

utilizes phase change materials (PCMs) for dynamic reconfiguration and power gating of the pho-

ii


tonic network, leading to lower latency and improved energy efficiency. Additionally, the research

investigates the integration of optical computation and communication into 2.5D chiplet platforms

for domain-specific machine learning (ML) processing. This approach aims to overcome limita-

tions in computation density and communication speeds faced by traditional accelerators, paving

the way for sustainable and scalable ML hardware. Furthermore, this dissertation proposes a 2.5D

chiplet-based architecture utilizing a silicon-photonic-based interposer, which tackles the limi-

tations of conventional bus-based communication by employing a novel switch-based network,

achieving significant energy efficiency improvements for high-bandwidth, low-latency data move-

ment in machine learning accelerators. The switch-based network employs our proposed optical

switch based on Mach–Zehnder Interferometer (MZI) devices with a dividing state to facilitate

broadcast and optimize communication for ML workloads. Finally, the dissertation explores secu-

rity considerations in 2.5D chiplet systems with diverse, potentially untrusted chiplets. To address

this, a secure routing framework for Network-on-Interposer is presented. The proposed secure

framework protects the system against distributed denial-of-service (DDoS) attacks by conceal-

ing predictable routing paths. It leverages multi-objective optimization to balance efficiency and

reliability for the NoI. The proposed contributions in this dissertation help advance the field of

chip-scale interconnection networks by proposing novel techniques for improved performance,

reliability, and power efficiency in 3D and 2.5D NoC architectures. These advancements hold

promise for the design of future high-performance computing systems, particularly in the areas of

machine learning and other computationally intensive applications.

iii


ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my advisor, Prof. Mahdi Nikdast, for his in-

valuable guidance, unwavering support, and endless patience throughout the journey of this disser-

tation. His expertise, encouragement, and constructive feedback have been instrumental in shaping

this work. Whenever deadlines loomed, he devoted his time to ensure our progress. Beyond the

technical realm, Prof. Nikdast extended his support to me during personal hardships, such as the

challenging period of pandemic-induced separation from my wife. His unwavering commitment

began from my first email to him in Iran, my home country, and continues to this day, five years

later. His dedication to our growth and success is truly commendable.

I am also grateful to my co-advisor, Prof. Sudeep Pasricha, for his insightful input, mentor-

ship, and dedication to helping me navigate the complexities of my research. His perspective and

encouragement have been immensely valuable in refining my ideas and methodologies. Prof. Pas-

richa generously devoted his time to assist me in every step of this project, and I consider myself

fortunate to have such a knowledgeable and supportive mentor.

I extend my heartfelt thanks to the members of my committee, Prof. Anura P. Jayasumana

and Prof. Yashwant K. Malaiya, for their time, expertise, and constructive criticism. Their diverse

perspectives and thoughtful insights have greatly enriched the quality of this dissertation.

Special gratitude goes to my mentor and collaborator outside Colorado State University. Prof.

Ahmad Patooghy, my master’s degree advisor, shaped my mindset and encouraged me to pursue a

Ph.D. I also want to express my appreciation to Prof. Rayan Kim, who provided significant assis-

tance and collaboration during the first two years of my Ph.D. project. Additionally, I am thankful

to Prof. Nader Sehatbaksh and his student, Pooya Aghanoury, at UCLA for their collaboration, as

well as Prof. Kaveh Rahbardar Mojaver for their assistance in my Ph.D. project.

To my collaborators and labmates, including Amin Mahdian, Kamil Khan, Febin Sunny, Alex

Qi, and Asif Mirza, I express my gratitude for your camaraderie, support, and intellectual ex-

change. I have learned a great deal from each of you and am thankful for the ideas we have built

iv


together. Additionally, I am grateful to my colleagues at EPIC lab and ECSyD lab. Your friendship

and collaboration have enriched this journey, making it both enjoyable and rewarding.

I owe a debt of gratitude to my wife, Mydeh, for her unwavering love, understanding, and

patience. Her encouragement, support, and sacrifices have been the bedrock of my success, and I

am profoundly grateful for her presence in my life.

Last but not least, I want to thank my parents for their unconditional love, support, and belief in

me. Their encouragement and sacrifices have been a source of strength and motivation throughout

this endeavor.

This dissertation would not have been possible without the support and encouragement of all

those mentioned above, as well as many others who have contributed in ways both seen and unseen.

Thank you all for being part of this transformative journey.

v


LIST OF RESEARCH PUBLICATIONS

1. E. Taheri, M. A. Mahdian, S. Pasricha, and M. Nikdast, "SwInt: A Non-Blocking Switch-

Based Silicon Photonic Interposer Network for 2.5D Machine Learning Accelerators", IEEE

Journal on Emerging and Selected Topics in Circuits and Systems.

2. E. Taheri, S. Pasricha, and M. Nikdast, "ReD: A Reliable and Deadlock-Free Routing Algo-

rithm for 2.5D Chiplet Networks", IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems.

3. E. Taheri, P. aghanoury, S. Pasricha, M. Nikdast, and N. Sehatbakhsh, "SCRIPT: A Multi-

Objective Routing Framework for Securing Chiplet Systems against Distributed DoS Attacks",

Proceedings of the Great Lakes Symposium on VLSI.

4. M. A. Mahdian, E. Taheri, K. Rahbardar Mojaver, and M. Nikdast, “Photonic Physically Un-

clonable Functions using Ring-Assisted Contra-Directional Couplers”, IEEE/Optica Optical

Fiber Communication (OFC) Conference, 2024.

5. F Sunny, E. Taheri, M Nikdast, and S Pasricha, "Silicon Photonic 2.5 D Interposer Networks

for Overcoming Communication Bottlenecks in Scale-out Machine Learning Hardware Accel-

erators", IEEE VLSI Test Symposium, 2024.

6. E. Taheri, M. A. Mahdian, S. Pasricha, and M. Nikdast, "TRINE: A Tree-Based Silicon Pho-

tonic Interposer Network for Energy-Efficient 2.5 D Machine Learning Acceleration", Proceed-

ings of the 16th International Workshop on Network on Chip, 2023.

7. M. A. Mahdian, E. Taheri, and M. Nikdast, "Pars: A power-aware and reliable control plane

for silicon photonic switch fabrics", International Conference on Photonics in Switching and

Computing (PSC), 2023.

8. E. Taheri, R. G. Kim, and M. Nikdast, "AdEle+: An Adaptive Congestion-and-Energy-Aware

Elevator Selection for Partially Connected 3D Networks-on-Chip", IEEE Transactions on Com-

puters, 2023.

vi


9. F Sunny, E. Taheri, M Nikdast, and S Pasricha, "Machine Learning Accelerators in 2.5 D

Chiplet Platforms with Silicon Photonics", IEEE/ACM Design, Automation and Test in Europe

(DATE) Conference and Exhibition, 2023.

10. E. Taheri, S. Pasricha, and M. Nikdast, "ReSiPI: A Reconfigurable Silicon-Photonic 2.5D

Chiplet Network with PCMs for Energy-Efficient Interposer Communication", IEEE/ACM In-

ternational Conference on Computer-Aided Design (ICCAD), 2022.

11. E. Taheri, S. Pasricha, and M. Nikdast, "DeFT: A Deadlock-Free and Fault-Tolerant Routing

Algorithm for 2.5D Chiplet Systems", IEEE/ACM Design, Automation and Test in Europe

(DATE) Conference and Exhibition, 2022.

12. F. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A Survey on Silicon Photonics for Deep

Learning,” ACM Journal on Emerging Technologies in Computing Systems (JETC) , 2021.

13. E. Taheri, R. G. Kim, and M. Nikdast, “AdEle: An Adaptive Congestion-and-Energy-Aware

Elevator Selection for Partially Connected 3D NoCs,” IEEE/ACM Design Automation Confer-

ence (DAC), 2021.

14. A. Mirza, S. Manafi Avari, E. Taheri, S. Pasricha, and M. Nikdast, “Opportunities for Cross-

Layer Design in High-Performance Computing Systems with Integrated Silicon Photonic Net-

works”, IEEE/ACM Design, Automation and Test in Europe (DATE) Conference and Exhibi-

tion, 2020.

vii


TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF RESEARCH PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Multicore and system-on-chip . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 On-chip communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Router architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.3 Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.5 Deadlock recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.6 Deadlock avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.7 3D Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.8 2.5D Networks-on-Chip (chiplet systems) . . . . . . . . . . . . . . . . 13

1.4 Silicon Photonic Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . 15

1.5 NoC design challenges in 2.5D and 3D chip technologies . . . . . . . . . 17

1.5.1 Network latency and traffic distribution . . . . . . . . . . . . . . . . . . 17

1.5.2 Fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5.3 Hardware security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5.4 Energy efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.6 Dissertation motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.7 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Chapter 2 Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Con-

nected 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Partially Connected 3D Networks-on-Chip . . . . . . . . . . . . . . . . 25

2.2.2 Elevator Placement and Routing Algorithms . . . . . . . . . . . . . . . 26

2.2.3 Elevator Selection Algorithms . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Proposed Elevator-Selection Algorithm: AdEle+ . . . . . . . . . . . . . . 29

2.3.1 Motivation: Routing in PC-3DNoCs . . . . . . . . . . . . . . . . . . . 29

2.3.2 Offline Optimization in AdEle+ . . . . . . . . . . . . . . . . . . . . . . 30

2.3.3 Adaptive Online Elevator Selection . . . . . . . . . . . . . . . . . . . . 32

2.3.4 Elevator Selection in AdEle+ under Low Traffic . . . . . . . . . . . . . 34

2.3.5 Adaptive Selection Between Distance-Based or Enhanced Round-Robin 37

viii


2.4 Experimental Results and Evaluations . . . . . . . . . . . . . . . . . . . . 38

2.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.2 Parameter Exploration and Optimization . . . . . . . . . . . . . . . . . 41

2.4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Chapter 3 A Reliable and Deadlock-Free Routing Technique for 2.5D Chiplet-based

Interposer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.1 Active interposers and their challenges . . . . . . . . . . . . . . . . . . 57

3.2.2 Deadlock-Free Routing in Active Interposers . . . . . . . . . . . . . . . 58

3.2.3 Fault-Tolerant Routing in Conventional 2D Networks . . . . . . . . . . 60

3.2.4 Fault-Tolerant Routing in 2.5D Chiplet Systems . . . . . . . . . . . . . 61

3.3 ReD: Proposed Routing Algorithm . . . . . . . . . . . . . . . . . . . . . 62

3.3.1 Proposed VN Separation and Deadlock-Freedom . . . . . . . . . . . . 63

3.3.2 Proposed Fault-Tolerant Congestion-Aware VL Selection . . . . . . . . 66

3.3.3 Fault-Tolerant Routing to Handle Horizontal Link Faults . . . . . . . . 72

3.3.4 Sharing Fault Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.3.5 Deadlock- and Livelock-Freedom . . . . . . . . . . . . . . . . . . . . . 79

3.4 Evaluation and Simulation Results . . . . . . . . . . . . . . . . . . . . . . 81

3.4.1 Simulation Environment and Configuration . . . . . . . . . . . . . . . 81

3.4.2 Latency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.4.3 Fault-Tolerance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.4.4 Hardware and Power Analysis . . . . . . . . . . . . . . . . . . . . . . 89

3.4.5 Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Chapter 4 A Reconfigurable Silicon-Photonic 2.5D Chiplet Network with PCMs for

Energy-Efficient Interposer Communication . . . . . . . . . . . . . . . . . . 91

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 94

4.2.1 Chiplet systems and electronic interposers . . . . . . . . . . . . . . . . 94

4.2.2 Silicon photonic interposers . . . . . . . . . . . . . . . . . . . . . . . . 94

4.2.3 PCM-based silicon photonic devices . . . . . . . . . . . . . . . . . . . 95

4.3 ReSiPI: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.1 Dynamic gateway management . . . . . . . . . . . . . . . . . . . . . . 97

4.3.2 ReSiPI interposer network architecture . . . . . . . . . . . . . . . . . . 98

4.3.3 Adaptive active gateway selection . . . . . . . . . . . . . . . . . . . . 101

4.3.4 Per-packet gateway selection . . . . . . . . . . . . . . . . . . . . . . . 103

4.3.5 Reconfiguration controller architecture . . . . . . . . . . . . . . . . . . 105

4.4 Simulation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 107

4.4.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.4.2 Design-space exploration: Optimal Lm . . . . . . . . . . . . . . . . . . 108

4.4.3 ReSiPI controller overhead . . . . . . . . . . . . . . . . . . . . . . . . 110

ix


4.4.4 Latency, power, and energy analysis . . . . . . . . . . . . . . . . . . . 111

4.4.5 Adaptivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4.6 Bandwidth distribution analysis . . . . . . . . . . . . . . . . . . . . . . 113

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Chapter 5 Machine Learning Accelerators in 2.5D Chiplet Platforms with ReSiPI-based

Silicon Photonic Interposer . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.2 Case study: Silicon photonic 2.5D DNN accelerator . . . . . . . . . . . . 116

5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4 Conclusion and open challenges . . . . . . . . . . . . . . . . . . . . . . . 121

Chapter 6 A Non-Blocking Silicon Photonic Switch-Based Interposer Network for 2.5D

Machine Learning Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2.1 Silicon Photonic Communication . . . . . . . . . . . . . . . . . . . . . 126

6.2.2 Silicon Photonic Interposers . . . . . . . . . . . . . . . . . . . . . . . 127

6.3 Proposed Interposer Network: SwInt . . . . . . . . . . . . . . . . . . . . 129

6.3.1 Architecture of the 2.5D Accelerator in SwInt . . . . . . . . . . . . . . 129

6.3.2 Motivation for Switch-based Network Architectures . . . . . . . . . . . 129

6.3.3 SwInt Switch Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3.4 Routing Strategies and Optimization in SWInt . . . . . . . . . . . . . . 136

6.3.5 Multicast and Broadcast Capabilities in SWInt . . . . . . . . . . . . . . 136

6.4 Design and evaluation of SwInt’s devices . . . . . . . . . . . . . . . . . . 137

6.4.1 MZS Design and Optimization . . . . . . . . . . . . . . . . . . . . . . 137

6.4.2 Micro Ring Resonator Design and Optimization . . . . . . . . . . . . . 141

6.5 SwInt’s Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . 144

6.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.6 SwInt with Bus Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.6.1 Switch-based and bus-based architectures . . . . . . . . . . . . . . . . 147

6.6.2 SwInt with Bus Integration: Design Trade-offs . . . . . . . . . . . . . 148

6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Chapter 7 SCRIPT: A Multi-Objective Routing Framework for Securing Chiplet Sys-

tems against Distributed DoS Attacks . . . . . . . . . . . . . . . . . . . . . 151

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Threat Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . 155

7.4 SCRIPT: Proposed Performance- and Security-aware Routing Framework . 156

7.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.4.2 SCRIPT Routing Process . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.4.3 Enhancing Security in VLs . . . . . . . . . . . . . . . . . . . . . . . . 157

7.4.4 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . 159

x


7.4.5 Obfuscated Deadlock-Free Routing . . . . . . . . . . . . . . . . . . . . 161

7.5 Evaluation and simulation results . . . . . . . . . . . . . . . . . . . . . . 164

7.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.5.2 Multi-Objective Optimization Results . . . . . . . . . . . . . . . . . . 165

7.5.3 Security and Latency Analysis . . . . . . . . . . . . . . . . . . . . . . 166

7.5.4 Area and Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Chapter 8 Summary and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

8.2.1 Integration of AdEle+ and ReD . . . . . . . . . . . . . . . . . . . . . . 172

8.2.2 Optimize ReSiPI Architectures to Address Non-Ideal Frequency Re-

sponse of PCMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

8.2.3 On-chip Network Monitoring and Machine Learning . . . . . . . . . . 172

8.2.4 Co-design of Communication Architectures and Machine Learning Work-

loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.2.5 Security Enhancements in SCRIPT . . . . . . . . . . . . . . . . . . . . 173

8.2.6 Scaling SwInt for Large-scale Systems . . . . . . . . . . . . . . . . . . 173

8.2.7 Efficient electronic controllers . . . . . . . . . . . . . . . . . . . . . . 173

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

xi


LIST OF TABLES

2.1 Simulation setup of AdEle+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2 Performance of selected solutions from Fig. 2.6 . . . . . . . . . . . . . . . . . . . . . 42

2.3 trDB for different elevator placements . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4 Area overheads for the different elevator selection algorithms . . . . . . . . . . . . . . 52

3.1 Abbreviations used in ReD’s chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.2 Simulation setup of ReD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.3 Area and power analysis of ReD, MTR, and RC . . . . . . . . . . . . . . . . . . . . . 88

4.1 Simulation setup of ReSiPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.2 Overhead analysis of ReSiPI’s controller (see Fig. 4.7). . . . . . . . . . . . . . . . . . 111

6.1 Simulation setup of SwInt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.1 Area and power: SCRIPT vs. DeFT and MTR . . . . . . . . . . . . . . . . . . . . . . 169

xii


LIST OF FIGURES

1.1 On chip communication paradigm: (a) In point-to-point communication each core is

directly connected to the other core using a physical link, (b) In Bus-based commu-

nicaiton the bus is shared between cores, and (c) In Network-on-chip, simultaneous

communication between core can be handled. . . . . . . . . . . . . . . . . . . . . . . 3

1.2 (a) a Networks-on-Chip with four cores, and (b) the corresponding topology of this

network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Network-on-Chip topology examples: (a) mesh, (b) torus, and (c) butterfly. . . . . . . 4

1.4 NoC router architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Routing example: (a) XY routing. (b) turn model of XY routing . . . . . . . . . . . . 8

1.6 Deadlock free turn model: (a) west-first turn model, (b) north-last turn model, and (c)

negative-first turn model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.7 Deadlock turn model examples: (a) the turns in the right-hand cycle create cyclic de-

pendency, (b) the combination of the turns in both cycles can create cyclic dependency,

and (c) an example of the combination from the avoided turn in b resulted in a deadlock

cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.8 Sub network to achieve deadlock freedom while offering fully adaptive routing: (a)

adaptive routing from south to north (e.g., S1 to D1, and (b) adaptive routing from

north to south (e.g., S2 to D2.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.9 An example of a 2.5D chiplet system with four chiplets connected through an inter-

poser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.10 Data transmission on a silicon photonic interposer. . . . . . . . . . . . . . . . . . . . 15

2.1 (a) An example PC-3DNoC with three elevators (e1, e2, and e3). The routing path

from S to D based on Elevator-First algorithm [1] (dotted-red line) and the minimal

path (blue-solid line) are shown. The middle-layer routers are colored based on their

Elevator-First selected elevator. (b) Traffic load on each router in the middle layer: the

e2 elevator is highly congested because of the inefficient elevator selection in Elevator-

First algorithm (7 out of 16 routers use this elevator router). . . . . . . . . . . . . . . . 25

2.2 An overview of our proposed elevator-selection algorithm: AdEle+. . . . . . . . . . . 27

2.3 (a) Closest elevator selection example shows that 11 out of 36 destination routers in the

bottom layer use non-minimal paths from source S (red destinations receive packets

non-minimally). (b) Distance-Based (DB) elevator selection in AdEle+ shows exam-

ple quadrants (RNW , RNE , RSW , RSE) based on S. Each elevator is color-coded

based on their quadrant (with some elevators with two colors), and the regional clos-

est elevator (RCE) for each region is marked with a color-coded star. Similarly, the

closest elevator (CE) is marked with a black star (e5). Our DB elevator selection will

consider the paths through the RCE in the destination quadrant and the overall closest

elevator. Our DB elevator selection only routes 2 out of 36 destinations non-minimally

from source S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

xiii


2.4 Comparison of different elevator selection policies in terms of average distance under

different network sizes and number of elevators. For each network size and number

of elevators, 100 random elevator-placement patterns are evaluated. Here, we report

both the average (average case) and the maximum (worst case) of all the 100 random

elevator patterns’ average distance. (a) 4× 4 layer size - average case; (b) 4× 4 layer

size - worst case; (c) 8× 8 layer size - average case; and (d) 8× 8 layer size - worst case. 36

2.5 Proposed framework to find trDB. ERR and distance-based elevator selections are

simulated under different injection loads. The average cost (2.8) of the load where

latency of both ERR and distance-based selections are equal is used as the minimal

threshold (trDB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.6 Elevator-subset solutions found by AMOSA in AdEle+. . . . . . . . . . . . . . . . . . 41

2.7 Impact of elevator set size on (a and c) uniform traffic and (b and d) shuffle traffic. . . . 42

2.8 Comparison of minimal versus ERR elevator selection under Uniform traffic: (a) av-

erage latency and (b) energy per flit. This is used to find trDB. . . . . . . . . . . . . . 43

2.9 Comparison of ERR selection under Uniform traffic with different values of trDB for

PL in (a) average latency and (b) energy per flit. . . . . . . . . . . . . . . . . . . . . . 44

2.10 Average latency for Elevator-First, CDA, AdEle, AdEle_RR, and AdEle+ under uni-

form traffic and using different elevator placements. . . . . . . . . . . . . . . . . . . . 45

2.11 Average latency for Elevator-First, CDA, AdEle, AdEle_RR, and AdEle+ under shuf-

fle traffic and using different elevator placements. . . . . . . . . . . . . . . . . . . . . 46

2.12 Comparison of elevator traffic distribution for Elevator-First (ElevFirst), CDA, and

AdEle+ in terms of (a) traffic load over elevators (blue, green, and red) normalized to

the average load over horizontal links (white bars) of Elevator-First; (b) average flit

residency over elevators; and (c) average flit residency of all elevators normalized to

average flit residency of horizontal links. AdEle+_DB shows DB selection of AdEle+. 46

2.13 Normalized energy per flit for Elevator-First, CDA, AdEle, AdEle_RR, and AdEle+

under uniform traffic with different injection rates. . . . . . . . . . . . . . . . . . . . . 47

2.14 Latency for Elevator-First (ElevFirst), CDA, AdEle and AdEle+ normalized to Elev-

First under real-application traffic with different elevator placements. . . . . . . . . . . 49

2.15 Energy for Elevator-First (ElevFirst), CDA, AdEle, and AdEle+ normalized to Elev-

First under real-application traffic with different elevator placements. . . . . . . . . . . 50

2.16 Packet size effect on latency and energy-per-flit. . . . . . . . . . . . . . . . . . . . . . 51

3.1 An abstract overview of the baseline 2.5D network with four chiplets on an active

interposer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 ReD’s rules for VN utilization and deadlock-freedom. . . . . . . . . . . . . . . . . . . 62

3.3 Examples for a VL selection in a chiplet with 16 routers: (a) A fault-free distance-

based selection (closest VL is selected) under uniform traffic, (b) A good selection

under non-uniform traffic, and (c) A distance-based selection with a VL fault and

under uniform traffic, suffering from imbalanced traffic load on VLs. Here, T and l
denote the inter-chiplet traffic rate of routers and VLs, respectively. Also, a router’s

color represents its selected VL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4 Deadlock-free turn models used for intra-chiplet and -interposer routing in (a) the first

VN and (b) the second VN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

xiv


3.5 Minimal routing to bypass HL faults (A2E: adaptive-to-east VN, A2W: adaptive-to-

west VN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.6 Non-minimal routing to get around HL faults (A2E: adaptive-to-east VN, A2W: adaptive-

to-west VN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.7 Proposed Hamiltonian-based approach for sharing faulty/healthy updates of VLs: (a)

Hamiltonian-based routing is used to distribute information through increasing and de-

creasing IDs, routed in the first and second VNs respectively, (b) and (c) two Hamiltonian-

based routings (green and black) are used for fault tolerance when sharing information,

specifically designed to handle HL faults. . . . . . . . . . . . . . . . . . . . . . . . . 76

3.8 Spare VLs used for packets which failed to receive VL updates and are routed to a

faulty VL: (a) Spare VL on chiplets and (b) Spare VL on the interposer. . . . . . . . . 76

3.9 Average latency comparison among ReD, MTR, and RC routing algorithms when ap-

plied to 2.5D networks with four-chiplet system. The comparison is conducted under

different synthetic traffic patterns: (a) Uniform, (b) Localized, and (c) Hotspot. . . . . 81

3.10 Average latency comparison under different system sizes and Uniform traffic. (a) 6

chiplets, (b) 8 chiplets, and (c) 12 chiplets. . . . . . . . . . . . . . . . . . . . . . . . . 81

3.11 VC utilization of ReD under Uniform, Localized, and Hotspot traffic patterns. . . . . . 82

3.12 Latency improvement under real-application traffic using (a) a single application, and

(b) two applications simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.13 Reachability in the presence of VL faults in a system with four chiplets. Note that

ReD-Wrst. and ReD-Avg. are the same (both shown by ReD). . . . . . . . . . . . . . . 85

3.14 Reachability in the presence of VL faults in a system with (a) four chiplets (total

VLs=32), and (b) twelve chiplets (total VLs=96). Note that ReD-Wrst. and ReD-

Avg. are the same (both shown by ReD). . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.15 Average latency in ReD with different VL-selection strategies, fault-tolerant approaches, and

fault-injection rates for a four-chiplet system. (a) VL faults, (b) HL faults on chiplets, (c) HL

faults on the interposer, and (d) uniformly combined faults on VLs, HL faults on chiplets, and

HL faults on the interposer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.16 Average percentage of updated routers when sharing the status of faulty/healthy VLs

for (a) chiplets and (b) the interposer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.17 Normalized energy comparison for a four-chiplet network with (a) no faults, (b) 12.5%

faulty VLs, and (c) 25% faulty VLs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 Dynamic inter-chiplet bandwidth management: Design A with a larger number of

wavelengths (e.g., four) and Design B, which is considered in ReSiPI, with a larger

number of gateways and fewer (e.g., two) wavelengths. . . . . . . . . . . . . . . . . . 96

4.2 An example of the proposed photonic interposer architecture (ReSiPI) with a total of

six gateways (one per chiplet) and four optical wavelengths. This architecture can be

extended to have multiple gateways per chiplet. . . . . . . . . . . . . . . . . . . . . . 97

4.3 A PCM-based coupler (PCMC) in different states: (a) crystalline state to guide light

to Bar (B) output, (b) partially crystalline state to guide a portion of light to the Cross

(C) output and the rest to the Bar output, and (c) amorphous state to guide the input

light to the Cross output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.4 Number of active gateways based on load changes. . . . . . . . . . . . . . . . . . . . 103

4.5 Dynamic gateway management in ReSiPI. . . . . . . . . . . . . . . . . . . . . . . . . 104

xv


4.6 An example of the adaptive gateway selection in ReSiPI for different number of acti-

vated gateways. The dashed boxes show the routers that will use a specific gateway

(G). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.7 ReSiPI’s reconfiguration controller architecture. . . . . . . . . . . . . . . . . . . . . . 107

4.8 Average latency vs. optical link transmission rate. . . . . . . . . . . . . . . . . . . . . 109

4.9 (a) Normalized average latency, (b) Normalized average power, and (c) Normalized

energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.10 Adaptivity comparison between ReSiPI and PROWAVES: (a) average delay, (b) aver-

age power, (c) number of activated gateways in ReSiPI, and (d) number of activated

wavelengths in PROWAVES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.11 Average residency of flits on the routers in the first chiplet in (a) PROWAVES and (b)

ReSiPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.1 Overview of proposed 2.5D interposer chiplet-based DNN accelerator architecture. . . 117

5.2 Example of optical communication on interposer: MACs are reading data from memory.118

5.3 Silicon photonic network in our 2.5D chiplet-based DNN accelerator. Each MRG is

connected to a gateway on a chiplet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4 Performance analysis of CrossLight, 2.5D-CrossLight with electronic interposer, and

2.5D-CrossLight with silicon photonic interposer, (a) normalized power consumption,

(b) normalized total latency, and (c) normalized energy-per-bit. . . . . . . . . . . . . . 120

6.1 Chiplet system architecture considered in SwInt. . . . . . . . . . . . . . . . . . . . . . 126

6.2 A bidirectional silicon photonic link with four wavelengths to provide optical commu-

nication between the GLB chiplet and a MAC chiplet. . . . . . . . . . . . . . . . . . . 127

6.3 (a) GLB to MAC chiplets communication via a bus-based approach, (b) MAC chiplets

to GLB communication via a bus-based approach, (c) SwInt facilitating GLB to MAC

chiplets communication using a switch-based interposer, and (d) SwInt enabling MAC

chiplets to GLB communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.4 (a) MRR’s Through-port response and optical loss under different wavelengths. (b-

c) power loss of filters in bus-based network compared to power loss of Benes and

Butterfly networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.5 An example of (a-b) Butterfly topology with suboptimal input selection, (c) SwInt

network for GLB to MAC chiplets communication, (d) SwInt network for MAC to

GLB chiplets communication, and (e) Benes topology. Orange switches are removed

compared to a typical Butterfly/Benes topology. Red switches suffer from blocking. . . 134

6.6 An example of a broadcast in SwInt, achieved using the Divide state in the MZI switch. 138

6.7 (a) An MZS including the MMI couplers and PN junction phase shifter. (b) an adia-

batic MRR (wr2> wr1), and (c) SEM image of our fabricated Adevice. . . . . . . . . . 141

6.8 (a) Field distribution of Cross-state (b) Bar-state, and (c) Divide-state. . . . . . . . . . 142

6.9 (a) The response of the fabricated MMI. (b) The power imbalance between the two

outputs of the MMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.10 Power imbalance effect of the "Divide" state on the scalability of broadcasting. . . . . 143

6.11 Power analysis: (a) Unicast state, and (b) Broadcast state. White portion in SPACX

and SwInt shows power saving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

xvi


6.12 (a) Latency analysis, (b) energy analysis (the acronym represents models’ name listed

in Section 6.5.1), and (c) scalability analysis. The results are normalized to average

case (Avg) in SPRINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.13 Analyse blocking instances of Butterfly in comparison with SwInt. (a) 4 chiplets, (b)

8 chiplets, and (c) 12 chiplets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.14 SwInt with Bus, with 2 readers at each output (B = 2). . . . . . . . . . . . . . . . . . 148

6.15 Exploring optimal B (number of readers at each output of SwInt) in SwInt with Bus

design under different number of wavelengths. . . . . . . . . . . . . . . . . . . . . . . 148

7.1 High-level overview of a 2.5D network with 4 chiplets, subdivided into 16 IPs. Each

chiplet contains four boundary routers interface with the interposer. Chiplet 1 & 2 are

malicious and transmit DDoS flood packets to Chiplet 3. . . . . . . . . . . . . . . . . 152

7.2 SCRIPT framework: (a) Design-time optimization takes into account the trust level

and criticality of chiplets, along with performance objectives, to propose a list of Ver-

tical Links (VLs) for use in the runtime routing. (b) In routing, a VL is selected, and

the routing mode (obfuscated or regular) is determined to deliver the packet to the

specified VL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Normalized average distance overhead with two different patterns of chiplet-connected

VLs (P1 and P2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.4 Multi-objective optimization and solution selection process. (a) Pareto front under

weighted distance and utilization variance objectives. Five solutions on the Pareto

front are selected for evaluation. (b) Latency results of the simulation under the se-

lected solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.5 (a) Normalized average latency under real application scenarios without attacks. X-

axis shows the initial two letters of each application. (b) Maximum average flit resi-

dency in a router among all routers of NoI under different attack scenarios (results are

averaged among all applications). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.6 Impact of an increased number of untrusted and attacker chiplets on (a) average la-

tency, and (b) flit residency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

xvii


Chapter 1

Introduction

The relentless demand for high computational power has been a driving force in the semicon-

ductor industry, leading to the development of innovative chip designs focused on scalability and

performance enhancement. One notable solution that emerged from this pursuit is the integration of

multiple cores onto a single chip, marking the era of multicore chips and system-on-chips (SoCs),

where numerous processing/memory cores are integrated onto the same chip. However, utilizing

numerous processing cores in chips with traditional communication infrastructures poses signifi-

cant challenges. NoCs have been introduced as a suitable communication platform for inter-core

communications within chips. The possibility of fabricating three-dimensional integrated circuits

and consequently creating three-dimensional on-chip networks has been a significant achievement

in chip fabrication, particularly in addressing the challenges of traditional chips, especially in terms

of latency. However, three-dimensional integration leads to a substantial increase in power density

in chips. The high power density in these networks creates hot spots, posing a threat to the relia-

bility of chips. Furthermore, chiplet systems, as an innovative technology, improve scalability and

fabrication cost by disintegrating large-scale chiplets and connecting them through a global inter-

poser. 3D and 2.5D chiplets both pave the way for a high density of cores on the same chip, aiming

for high computational power and energy efficiency, especially with recent advances in machine

learning applications, where there is a high demand for parallel processing and high computational

power with minimized energy consumption to address sustainability. However, there are significant

energy efficiency, fault tolerance, and security challenges in providing efficient communication for

such new technologies. Therefore, designing and optimizing efficient NoCs for these innovative

technologies is of great importance. Silicon photonic technology, in which on-chip communication

is enabled by exchanging data using optical signals, is also a great candidate to be used in global

interposer networks, offering high bandwidth and low latency for relatively longer distances on the

chip.

1


This chapter introduces multicore chips, SoCs, NoCs, 3D, and 2.5D integration. Moreover, it

discusses how silicon photonics can be useful in enhancing the energy efficiency and performance

of 2.5D chiplet systems. It also briefly discusses energy efficiency, fault tolerance, and security in

these networks, and briefly outlines the objectives of this dissertation.

1.1 Multicore and system-on-chip

Multicore chips represent a paradigm shift in chip architecture, where several processing cores

are integrated on the same chip to leverage parallel processing capabilities. For instance, IBM

Cyclops-64 chip includes 160 processing cores integrated on the same chip [2]. By incorporating

multiple cores, multicore chips enable concurrent execution of tasks, thereby significantly boosting

computational throughput and efficiency. The concept of multicore chips extends beyond homoge-

neous architectures to embrace heterogeneity, exemplified by the advent of System-on-Chip (SoC)

designs. In SoC implementations, diverse processing units such as central processing units (CPUs),

graphics processing units (GPUs), caches, and specialized accelerators are integrated onto a sin-

gle chip. This heterogeneous architecture facilitates efficient utilization of resources by allocating

specific tasks to the most suitable processing element, thus optimizing performance for various

workloads. The integration of heterogeneous cores within SoCs introduces a new era of compu-

tational efficiency and versatility. By leveraging the strengths of different processing units, SoCs

empower applications to harness a broader spectrum of computational capabilities, ranging from

general-purpose computing tasks handled by CPUs to parallelized data processing facilitated by

GPUs and accelerators. Furthermore, the scalability inherent in multicore and heterogeneous chip

designs aligns with the ever-growing demands of modern computing applications. As workloads

become increasingly complex and diverse, the ability to seamlessly scale computational resources

through multicore and heterogeneous architectures becomes paramount in meeting performance

requirements while managing power consumption and cost constraints.

However, assuming a large number of cores on a single chip, an effective communication

paradigm should be designed to achieve high performance.

2


ChipChipChip

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

On chip networkBus

(a) (b) (c)

Figure 1.1: On chip communication paradigm: (a) In point-to-point communication each core is directly

connected to the other core using a physical link, (b) In Bus-based communicaiton the bus is shared between

cores, and (c) In Network-on-chip, simultaneous communication between core can be handled.

1.2 On-chip communication

Point-to-point communication, where there exists a dedicated physical connection between

two cores, is not scalable. Assuming N cores on a chip, providing communication between all

cores would require N × (N − 1) direct connections (almost equal to N2 in a large-scale chip).

Consequently, point-to-point communication proves impractical for current and future SoCs. The

primary concern stems from limitations in the number of metal layers available on a chip’s manu-

facturing technology.

To alleviate such limitations, traditional bus-based communication architectures have been in-

troduced, where one or multiple buses are shared among all cores. However, bus-based commu-

nication suffers from scalability issues, both in terms of latency and performance. When a bus

is allocated for communication between two cores, other cores sharing the same bus must wait

until the communication is complete before proceeding with their own communication tasks. This

waiting introduces latency into the communication process, and as the number of cores sharing the

same bus increases, so does the latency imposed on the communication packets.

To address these scalability concerns in on-chip communication, Networks-on-Chip (NoCs)

have been proposed. NoCs offer a more efficient and scalable communication paradigm compared

to traditional point-to-point and bus-based architectures. These communication paradigms are

compared in Figure 1.1.

3


Chip

Core Core

Core Core
Network-on-Chip

RouterRouter

Router Router

Link

(a) (b)

Figure 1.2: (a) a Networks-on-Chip with four cores, and (b) the corresponding topology of this network.

(a) (b) (c)

Figure 1.3: Network-on-Chip topology examples: (a) mesh, (b) torus, and (c) butterfly.

1.3 Networks-on-Chip

In comparison with conventional communication paradigms such as point-to-point and bus-

based communication, NoCs provide efficient communication for on chip systems because they

present high performance, low power consumption, and more scalability. NoCs provide com-

munication for processing elements employing routers which are connected using interconnects.

NoC has emerged as a prevalent solution to enable scalable on-chip communication in manycore

systems [3]. The Figure 1.2 illustrates a NoCs architecture featuring four cores, along with the

corresponding topology depicting the interconnection scheme between these cores.

4


1.3.1 Topology

NoC architectures provide versatile communication flexibility through a variety of topologies,

which dictate how routers are interconnected. These topologies, exemplified by mesh, torus, and

butterfly configurations in Figure 1.3, significantly influence network performance based on spe-

cific application requirements. The mesh topology establishes a grid-like interconnection, enabling

horizontal and vertical communication among cores. Mesh is favored for its simplicity, as its pla-

nar topology reduces fabrication costs and ensures uniform link lengths, facilitating low worst-case

link delay for high-frequency operation. However, it suffers from a large diameter and average dis-

tance between nodes, particularly at the network’s edges where connectivity is lower. To address

this, the torus topology forms a toroidal structure, linking outermost cores, thus enhancing com-

munication efficiency and reducing latency. Nonetheless, torus configurations may incur longer

link delays and increased complexity. Meanwhile, the butterfly topology organizes cores hier-

archically, enabling parallel communication paths resembling a butterfly’s wings. These diverse

topologies present trade-offs in scalability, latency, and power consumption, allowing designers

to customize NoC architectures to meet specific SoC design needs. Additionally, NoC topolo-

gies, categorized into standard and application-specific configurations, determine router and core

interconnections. Standard topologies ensure connectivity between all routers, while optimized

configurations match traffic patterns for improved performance and reduced costs. Mesh and torus

configurations are prevalent in standard setups, offering simplicity and efficiency, while specialized

configurations cater to specific application requirements, enhancing performance at the expense of

planar properties or complexity.

1.3.2 Router architecture

A router in an NoC, as shown in Figure 1.4, comprises the following main components:

• Buffers: These buffers are of the first-in-first-out (FIFO) type and are used to store packets

passing through the router. Routers can have input buffers, output buffers, or both. For exam-

ple, an input-buffered router stores incoming data first and then, after routing mechanisms,

5


Switch

Routing and 

Arbitration

Buffers
In

p
u

t 
L
in

k
s

Local Core

Link Controller

O
u

tp
u

t 
Li

n
k
s

Figure 1.4: NoC router architecture.

switch allocation, and link assignment, sends the data to the output link. In the depicted

model in Figure 1.4, both input and output channels have buffers.

• Switch: This unit is responsible for connecting input buffers or links of the router to the

corresponding output buffers or links.

• Routing and Arbitration Unit: This unit implements routing algorithms, uses the output link

for an input packet, and sets the crossbar accordingly. If multiple packets simultaneously

request the use of the same output link, this unit arbitrates between them. If the link is busy,

the input packet waits in the input buffer, and eventually, after the link becomes available

again, it is routed.

• Link Controller: The flow of packets in the physical channel between neighboring routers is

managed by the link controller unit. A link controller is placed on both sides of a channel to

transmit flow control units.

6


1.3.3 Switching

Data communication in NoC involves dividing data intended for transmission between cores

into packets, which are further divided into smaller units called flits. Each packet is then divided

into even smaller units called fits, determining the width of the communication channel, typically

considered equal to a flit. Switching can be categorized based on temporal behavior and packet

storage methods [4]:

Circuit-switched Routing

This method involves reserving a path (circuit) from the source to the destination before data

transmission. All packets of a message utilize the same path. A route-request is sent from the

source to the destination to establish the circuit, allocating channels for packet transmission. Cir-

cuit channels remain dedicated to the source-destination pair until data transmission ends, after

which resources are released. However, circuit construction overhead and difficulty in finding

circuits during network congestion are drawbacks.

Packet-based Routing

This method does not require route establishment before data transmission. Routing is per-

formed separately for each packet, potentially leading to different routes for different packets of

the same message. Packet-based routing includes three types:

• Store-and-Forward: Upon reaching a node, the entire packet is stored in the buffer, and

routing decisions are made based on decoded routing information. If the next router can

accommodate all packets, the packet is forwarded without complete reception. This method

ensures accurate routing but may lead to higher latency.

• Direct Virtual Cut-Through: Routing information is placed in the header of the first flit. Only

the first flit is stored, and routing decisions are made based on this information, enabling

faster transmission compared to store-and-forward routing.

7


(a)

X

Y (North)

S1

D1

(b)
West-north 

turn

Figure 1.5: Routing example: (a) XY routing. (b) turn model of XY routing

• Wormhole switching involves breaking data into small packets, each with a control header

for routing. These packets can be transmitted through different paths, reducing buffer usage

compared to direct virtual cut-through. Wormhole switching is widely regarded as the most

popular and efficient switching method in NoC, and it will be the primary switching method

discussed throughout the remainder of this dissertation. However, it’s important to note that

the likelihood of deadlocks, or infinite cyclic dependencies in the network, increases signif-

icantly with this method, necessitating further discussion and the exploration of mitigation

strategies in subsequent chapters.

1.3.4 Routing

Routing in NoC architectures dictates how packets traverse intermediate routers from a source

router to a destination router. Figure 1.5 provides an illustration of an XY routing example, show-

casing the routing path taken by data packets through a mesh topology. It also depicts the potential

turns resulting from XY routing. For instance, when routing from S1 to D1, the west-north turn is

taken, as shown in the figure. However, since YX routing is not implemented, the dashed-red turns

are not taken, effectively breaking a cycle. This cycle-breaking mechanism is crucial for avoiding

deadlocks, a topic that will be discussed in more detail later in this section.

Routing strategies are categorized based on several criteria. Firstly, routing can be either dis-

tributed or centralized. In distributed routing, each router makes decisions locally based on packet

information, while in centralized routing, decisions are made for the entire network from specific

8


locations. Additionally, routing can be single-destination or multi-destination, allowing one sender

to transmit data to multiple receivers. Furthermore, routing can be oblivious, with no knowledge

of network conditions, or adaptive, adjusting routing paths based on network conditions, known

as adaptive routing. In adaptive routing, decision-making can occur locally at each node. Routers

play a crucial role in routing by determining how to forward inputs to outputs using specific algo-

rithms. Each router utilizes routing information contained in the packet header to route the packet

to the appropriate output. Despite these processes, routing in NoC architectures can encounter

challenges leading to packet loss or delays.

The three main concerns in routing are deadlock, livelock, and starvation. Deadlock occurs

when packets wait for each other to proceed, resulting in cyclical dependencies and network sat-

uration. Livelock arises when packets circulate without reaching their destination. Starvation can

occur when packets with lower priorities never reach their destination due to resource allocation to

higher-priority packets. One of the most important aspects of designing a deadlock-free routing al-

gorithm is the demand for cycle breaking. The cyclic dependency between buffer of routers results

in infinite waiting between channels and deteriorates network performance. Deadlock mitigation

methods are broadly divided into two categories: turn model-based and virtual channel-based rout-

ing algorithms.

1.3.5 Deadlock recovery

Deadlock recovery in NoC architectures refers to the process of detecting and resolving dead-

lock situations where multiple routers are blocked, unable to proceed due to circular dependencies

in buffer allocation. While deadlock recovery mechanisms can help revive a system from such

situation, it is critical to avoid deadlock altogether rather than relying solely on recovery strategies.

Deadlocks can bring communication to a grinding halt, rendering the entire system non-functional.

Therefore, it is imperative to prioritize the prevention of deadlock rather than relying solely on re-

covery strategies.

9


(a) (b)

X

Y (North)

(c)

Figure 1.6: Deadlock free turn model: (a) west-first turn model, (b) north-last turn model, and (c) negative-

first turn model.

Deadlock recovery mechanisms, while useful as a last resort, come with inherent drawbacks.

They introduce complexity and overhead to the system, which can degrade performance and effi-

ciency. Additionally, the time taken for deadlock recovery can lead to significant delays, particu-

larly in time-sensitive applications, impacting system responsiveness and reliability.

Moreover, relying on recovery strategies alone does not address the root cause of deadlock. It

merely resolves immediate deadlock situations without guaranteeing prevention of future occur-

rences. This can result in a vicious cycle of deadlock detection and recovery, leading to system

instability and inefficiency.

Therefore, the focus should be on implementing deadlock avoidance strategies as a fundamen-

tal aspect of NoC design. By prioritizing deadlock avoidance, designers can create more robust

and reliable systems that operate seamlessly without the need for frequent recovery interventions.

This approach ensures the continuous and uninterrupted flow of data within the NoC, safeguarding

system performance and functionality.

1.3.6 Deadlock avoidance

Turn Model

As we discussed the turns in XY routing, resulted in deadlock-free routing since the resulting

turns cannot make a cycle. The turn model algorithm determines that one or more turns during

packet routing are not allowed, therefore cutting off cyclic dependencies between channels. Three

famous models have been described for this method, which are shown in Figure 1.6:

10


(a) (b)

X

Y (North)

(c)

Figure 1.7: Deadlock turn model examples: (a) the turns in the right-hand cycle create cyclic dependency,

(b) the combination of the turns in both cycles can create cyclic dependency, and (c) an example of the

combination from the avoided turn in b resulted in a deadlock cycle

• West-first Turn Model: In this turn model, all the turn towards the west are prohibited. while

this rule breaks any possible cyclic dependency, a packet that moves westward cannot be

routed adaptively.

• North-Last Turn Model: In this turn model, similarly, a packet that moves northward cannot

be routed adaptively routed. In other words, after moving northward, it cannot turn anymore.

• Negative-first Turn Model: In this turn model, a packet that moves in the positive direction

cannot turn in the negative direction.

As it can be seen, by only avoiding one turn among the four turns in each cycle, deadlock can

be avoided. However, there are cases where avoiding one turn per cycle cannot break the cyclic

dependency, as shown in Figure 1.7

Virtual Channels

Flow control mechanisms between two routers are such that packets between two routers is

stored in the input and output buffers of each channel (link), and passage through the physical

channel only occurs at certain time clocks. Since the buffers are FIFO, the packets that enter first

is prepared for transmission to the next path before others; therefore, when a packet occupies the

buffers of a channel, another packet, even when physically free, cannot access the channel. In this

situation, by adding a buffer next to the buffers of each channel, it is possible to use the physical

11


(b)(a)

X

Y (North)

S1

D1

D2

S2

Figure 1.8: Sub network to achieve deadlock freedom while offering fully adaptive routing: (a) adaptive

routing from south to north (e.g., S1 to D1, and (b) adaptive routing from north to south (e.g., S2 to D2.)

channel, effectively sharing the physical channel among virtual channels. In this case, it is possible

to save on hardware overhead by adding virtual channels. By adding virtual channels, the flow of

each virtual channel is separated, and different turns can be added to virtual channels. In this

case, the turn model is examined separately for each virtual channel, and it is possible to transfer

information from one channel to another as long as cyclic dependencies between virtual channels

are not created. For example, if there are two virtual channels, it is possible to send a packet from

the first channel to the second, provided that a packet from the second channel is not sent to the first.

Therefore, virtual channels not only improve network performance but also potentially can break

the cyclic dependencies and prevent deadlock at the cost of hardware overhead due to the addition

of buffers. For instance, two subnetworks presented in Figure 1.8 can prevent deadlock and offer

fully adaptive routing, if one virtual channel is assigned to each sub network. Therefore, following

the routing rules (avoided/allowed turns) in each sub-network, deadlock freedom is guaranteed.

1.3.7 3D Networks-on-Chip

Conventional two-dimensional NoC (2D NoCs) suffer from low scalability to support low-

latency communication of a large number of on-chip processing cores. To address this challenge,

three-dimensional NoCs (3D NoCs) have been proposed to offer high-performance communication

for modern computers [5]. With the advances in three-dimensional (3D) integration technologies,

systems with stacked dies are interconnected using Through-Silicon Vias (TSVs), further improv-

12


ing NoC scalability, integration density, and system heterogeneity [6–8]. From a network per-

spective, 3D topologies reduce the average inter-node distance and save energy consumption [5].

However, connecting the layers of 3D chips necessitates the use of inefficient vertical link tech-

nologies. The most promising vertical link technology is TSV, although it introduces high cost

and low reliable links. Partially connected NoCs (PCNoCs) have been proposed to alleviate the

high cost of vertical links by utilizing a smaller number of TSVs. In 3D NoCs, each vertical link

(a.k.a. elevator) includes tens or even hundreds of TSV wires. Therefore, removing some elevators

greatly lightens the fabrication cost of the NoCs while imposing a small latency overhead. How-

ever, as some links are eliminated, the network topology becomes an irregular one, which makes

the routing process more complex. Elevators are shared among routers, and each router selects an

elevator to route the packet to another layer.

The challenges of 3D NoCs include high fabrication costs, power density issues, and limited

cooling conductivity, leading to thermal hotspots and reliability concerns [9,10]. To mitigate these

challenges, 2.5D NoCs (a.k.a. chiplet systems), offer a compelling alternative. In a 2.5D NoCs,

chiplets are integrated on an interposer, enabling inter-chiplet communication while maintaining

modularity. This approach not only addresses the drawbacks of 3D integration but also facilitates

improved manufacturing yield by disintegrating large chips into smaller, interconnected chiplets.

With the continuous demand for high compute performance and parallelism in data-intensive ap-

plications, 2.5D chiplet systems present a scalable solution, allowing for easier integration of pre-

designed chiplets and speeding up time-to-market for electronics designs.

1.3.8 2.5D Networks-on-Chip (chiplet systems)

The continuous expansion of data- and compute-intensive applications, such as big data analyt-

ics and deep learning, has spurred the development of large-scale chips with heightened compute

performance and extensive parallelism. These chips, featuring numerous processing cores ranging

from several tens to hundreds, provide the computational prowess necessary for emerging applica-

tions but are hindered by low manufacturing yields owing to their substantial chip size. In response

13


Interposer

Chiplet Chiplet

Chiplet Chiplet Chiplet

Core Core Core

Router Router Router

Core Core Core

Router Router Router

Vertical link (microbumps) Horizontal link

Figure 1.9: An example of a 2.5D chiplet system with four chiplets connected through an interposer.

to this challenge, 2.5D chiplet systems have emerged, where a large chip is fragmented into sev-

eral smaller chiplets interconnected via an inter-chiplet interposer network, substantially enhancing

collective manufacturing yield. Illustrated in Figure 1.9, a typical 2.5D chiplet system integrates

multiple chiplets on an interposer, facilitating inter-chiplet communication. Each chiplet typically

incorporates multiple processing cores interconnected by an intra-chiplet mesh-based NoC. This

modular approach allows for the assembly of chiplets into new systems, effectively addressing

design challenges, particularly time to market considerations, in the electronics domain.

Moreover, the 2.5D integrated approach offers a modular solution by housing several chiplets

on an interposer that supports inter-chiplet communication. Similar to 3D SoCs, such modular

integration enables heterogeneous and independent design of each chiplet. Furthermore, the design

cycle for chiplet systems is notably shortened compared to traditional monolithic designs, as off-

the-shelf chiplets can be readily integrated on an interposer to create chiplet systems with varying

computation capabilities tailored for different applications. Consequently, chiplet-based systems

can harness diverse manufacturing processes and technologies for each chiplet, amplifying the

overall flexibility and efficiency of the system architecture. Given the escalating demands for high

bandwidth and low latency in modern computing tasks, particularly in data- and compute-intensive

applications, the adoption of optical communication on the interposer emerges as a compelling

solution to meet these stringent requirements effectively.

14


Silicon photonic interposer

Writer gateway (on chiplet) Reader gateway (on chiplet)

 MR modulators

Grating 

coupler

data

clk

Modulator 

driver

PDPDPD

data

clk

 MR filters

O
p

ti
ca

l 

fi
b

e
r

Photodiode

Unmodulated 

signals

Modulated 

signals

Waveguide

Off-chip 

laser

Figure 1.10: Data transmission on a silicon photonic interposer.

1.4 Silicon Photonic Networks-on-Chip

Electronic interconnects in most of the designs (e.g designs with long interconnects) results

in insufficient energy efficiency, high latency and low bandwidth. In such conventional designs,

energy efficiency is limited within the distance of smaller than 300mm [11]. Moreover, some

effects like skin effect deteriorate into higher losses; and wiring density results in losses because

of resistance and capacitance of wires. As a result, conventional electronic interconnect is not a

promising solution for the future of NoCs [12].

Silicon photonic communication offers a paradigm shift in on-chip data transfer, particularly

advantageous for scenarios demanding high bandwidth and low energy consumption. While con-

ventional metallic-based NoCs suffice for intra-chiplet communication within small-scale designs,

silicon photonic solutions excel in inter-chiplet communication, especially on platforms like in-

terposers. Leveraging silicon photonic technology on an interposer enables the provision of high-

bandwidth communication among chiplets over relatively long distances on the interposer itself.

This capability is particularly pertinent in domains such as machine learning accelerators, where

vast amounts of data need to be rapidly shuttled between different processing units. Our proposed

silicon photonic interposers, detailed later in subsequent chapters, aim to address these communi-

cation challenges efficiently.

15


In contrast to metallic-based interconnects, silicon photonic communication offers several key

advantages. Firstly, it allows for significantly higher bandwidth transmission over longer distances

while consuming minimal power. This efficiency is particularly crucial in modern computing archi-

tectures, where the demand for data transfer rates continues to escalate. Furthermore, the integra-

tion of silicon photonic components onto existing semiconductor fabrication processes facilitates

seamless adoption without requiring substantial infrastructure changes. These factors collectively

position silicon photonic communication as a compelling solution for future-generation comput-

ing systems, especially in scenarios necessitating high-speed, low-latency data exchange among

disparate chiplets.

Fig. 1.10 shows an example of data transmission between two chiplets that are placed on a

silicon-photonic interposer. On the interposer, some modulators, filters, and photodiodes (PDs)

are used to perform electro-optical and opto-electrical data conversions. In this dissertation, we

consider microring resonator (MR) devices [13, 14], for modulators and filters, due to their area

and power efficiency. We define a gateway as an electronic circuit on a chiplet, which controls

the modulators (i.e., writer gateway in Fig. 1.10) and PDs (i.e., reader gateway in Fig. 1.10) on

the interposer. Moreover, gateways receive/send data from/to the routers on the same chiplet. As

shown in Fig. 1.10, optical signals with different wavelengths are generated in an off-chip laser

source (green, red, and blue wavelengths). The optical signals are then coupled to the waveguide

on the photonic interposer using an optical fiber and a grating coupler. At the writer gateway,

MRs modulate electronic data on the optical signals and, at the reader gateway, each MR filters its

corresponding optical signal to be detected by the PD. Note that each MR device is designed to

resonate at (i.e., couple with) a specific wavelength. As a result, several wavelengths can transmit

bits of data at the same time, over the same waveguide; this technique is called WDM.

16


1.5 NoC design challenges in 2.5D and 3D chip technologies

1.5.1 Network latency and traffic distribution

Network latency is one of the primary design challenges for NoCs. As the number of cores

and processing elements in a chip increases, so does the amount of data traffic between them. This

traffic must be managed efficiently to prevent delays and ensure that each core can access the data

it needs in a timely manner.

1.5.2 Fault tolerance

Faults in NoC architectures are categorized into three main types based on their duration and

impact: permanent faults, transient faults, and periodic faults. Permanent faults persist in the

system until repaired, while transient faults occur under specific conditions and last for a limited

period. Periodic faults, such as thermal faults, recur at regular intervals and may render the af-

fected component temporarily unusable until rectified. These faults can affect the performance

and reliability of NoC systems, which are susceptible to various failure mechanisms influenced by

environmental conditions and process variations.

Radiation is a significant source of faults, originating from cosmic neutrons and alpha parti-

cles, which can induce Single Event Upsets (SEUs) or Single Event Transients (SETs) in memory

elements or logic gates. Electrostatic discharge can break electrical bonds in electronic compo-

nents, leading to dielectric, PN junction, or interconnect breakdowns, exacerbated by increased

heat generation and leakage current. Aging effects, accelerated by heat, can degrade CMOS hard-

ware performance and increase failure rates, particularly in interconnects. Electrical migration,

driven by high current density, causes thinning of metal contacts over time, leading to timing fail-

ures or circuit malfunction. Heat-induced faults, especially prevalent in submicron technologies

and three-dimensional on-chip networks, result from temperature-induced variations in transistor

characteristics, affecting speed, power consumption, and reliability.

To cope with faults, three main methods are employed: fault avoidance, fault hiding, and fault

tolerance. Fault avoidance aims to prevent fault progression through testing, and quality control

17


methods. Fault hiding reduces the severity or number of fault effects, often by incorporating re-

dundancy in the system. Fault tolerance enables the system to continue operating despite faults,

achieved through fault masking, detection and relocation, or routing algorithms that minimize ex-

posure to faults. These coping methods are essential for ensuring the reliable performance of NoC

architectures in the face of potential faults and failures.

Fault-Tolerant Routing algorithms ensure network reliability in the face of faults, selecting

alternative paths to avoid disruptions caused by faults. These algorithms employ techniques such

as fault regions and turn models to mitigate the impact of faults on network performance and

reliability.

1.5.3 Hardware security

Hardware security focuses on protecting computing systems from physical attacks and vulner-

abilities in the underlying hardware components. In this disseration we specifically focus on the

security of interposers in 2.5D chiplet systems. This is crucial because the interposer is the cen-

tral communication hub for all chiplets, and if compromised, could bring the entire system down.

Security is especially important in chiplet systems compared to traditional monolithic chips due

to several reasons. First, chiplet systems often integrate components from various sources, some

of which might be untrusted. These untrusted chiplets could potentially launch Denial-of-Service

(DoS) attacks by flooding the interposer with traffic, making it unavailable for legitimate communi-

cation. Second, chiplet systems rely on vertical links for communication between chiplets stacked

on top of the interposer. These vertical links are expensive and critical for system performance,

making them prime targets for attackers.

1.5.4 Energy efficiency

Energy efficiency is another critical concern for chip designers, especially for NoCs. As the

number of cores on a chip increases, so does the power required to transmit data between them.

NoC designers must carefully consider energy consumption when designing communication ar-

chitectures to ensure optimal performance without excessive power demands. Energy efficiency

18


is paramount in 2.5D chiplet systems. These systems integrate multiple chiplets to overcome the

limitations of large-scale chip fabrication. While this approach offers benefits, it also leads to a

higher traffic load on the interposer network. To handle this increased traffic, solutions like silicon

photonic communication have been proposed for high-bandwidth inter-chiplet communication.

However, silicon photonics, despite its advantages, suffers from high power consumption in inter-

poser networks. Therefore, designing energy-efficient 2.5D chiplet systems is critical to minimize

power usage and ensure system reliability, even with the increased traffic demands.

1.5.5 Scalability

Scalability is another crucial aspect of NoC design. As the number of cores on a chip, and po-

tentially the number of chiplets in a system, continues to increase in the future, NoCs must be able

to adapt and handle this growth. Designers employ various techniques to achieve scalability. One

approach is using hierarchical topologies, where smaller networks are combined to form larger

ones, such as integrating 2.5D and 3D networks. however, dynamic routing algorithms and recon-

figurable architectures that can adjust to changes in the network’s configuration are necessary to be

designed. Additionally, NoC virtualization allows combining multiple NoCs into a single virtual

network, enabling scaling across multiple chips or even entire systems.

1.6 Dissertation motivation

The motivation for this dissertation stems from the urgent need to tackle key challenges in

NoC architectures, specifically focusing on 3D and 2.5D topologies and interposer designs, while

addressing latency, energy efficiency, fault tolerance, and security concerns. This research endeav-

ors to overcome the limitations of 3D NoCs, including high fabrication costs and reliability issues

related to TSV technology, by introducing an adaptive congestion- and energy-aware elevator-

selection algorithm to mitigate latency and enhance energy efficiency. Furthermore, the disserta-

tion aims to enhance the reliability and reduce latency in 2.5D chiplet systems through the devel-

opment of fault-tolerant routing algorithms. The increasing significance of heterogeneous 2.5D

19


integration, facilitating seamless chiplet integration to reduce design time and costs, underscores

the need for ensuring the security and reliability of the Network-on-Interposer (NoI). Address-

ing this imperative, the dissertation also seeks to develop a secure routing algorithm to safeguard

chiplet systems against Denial-of-Service (DDoS) attacks, an aspect yet to be adequately addressed

in current research. Additionally, the dissertation aims to contribute novel solutions to further en-

hance the efficiency of chiplet systems, considering silicon photonic NoIs. The proposed photonic

NoI with PCM for energy-efficient interposer communication, aims to mitigate power consumption

in interposer-based photonic networks, providing lower latency, reduced power consumption, and

enhanced energy minimization compared to existing photonic networks. Furthermore, 2.5D accel-

erators requires efficient interposer networks for low-latency and energy-efficient communication,

facilitating parallel Multiply and Accumulation (MAC) operations.

1.7 Dissertation outline

The dissertation explores various aspects of interconnect architectures for emerging 3D and

2.5D topologies. Chapter 2 presents an adaptive congestion-and-energy-aware elevator selections

framework for Partially Connected 3D NoCs (PC-3DNoCs), proposing an algorithm for efficient

elevator selection. Chapter 3 delves into a reliable and deadlock-free routing approach for 2.5D

Chiplet-based Interposer Networks, introducing a routing algorithm to ensure reliable communi-

cation in active interposers. In Chapter 4, a reconfigurable Silicon-Photonic 2.5D Chiplet Network

with Phase Change Materials (PCMs) is outlined, emphasizing energy-efficient communication

through the ReSiPI architecture. Chapter 5 extends the discussion to machine learning accelera-

tors in 2.5D chiplet platforms, leveraging ReSiPI-based Silicon Photonic Interposers for improved

performance. Chapter 6 introduces a non-blocking switch-based silicon photonic interposer net-

work for 2.5D machine learning accelerators, showcasing the SwInt architecture for efficient data

transfer. Chapter 7 introduces a multi-objective routing framework designed to secure chiplet sys-

tems against distributed denial of service (DDoS) attacks. Through these chapters, the dissertation

aims to provide comprehensive insights into advanced interconnect technologies and their applica-

20


tions in next-generation chiplet-based systems while outlining avenues for future research. Finally,

Chapter 8 summarizes this dissertation and provides some suggestions for future work.

21


Chapter 2

Adaptive Congestion-and-Energy-Aware Elevator

Selection for Partially Connected 3D NoCs

Unlike conventional 2D Networks-on-Chip (NoCs), 3D NoCs offer a scalable and energy ef-

ficient on-chip communication. Vertical die stacking of 3D NoCs is enabled using inter-layer

Through-Silicon Via (TSV) links. However, TSV technology suffers from low reliability and high

fabrication costs. To mitigate these costs, Partially Connected 3D NoCs (PC-3DNoCs), which

use fewer vertical TSV links, have been introduced. However, with fewer vertical links (a.k.a.

elevators), elevator-less routers will have to send their traffic to nearby elevators for inter-layer

traffic, increasing the traffic load and congestion at these elevators and potentially reducing perfor-

mance. Therefore, it is important that elevator-less routers choose elevators that balance the traffic

load among the available elevators. To address this problem, this chapter presents an adaptive

congestion- and energy-aware elevator-selection algorithm, called AdEle+. AdEle+ employs an

offline multi-objective simulated-annealing-based optimization to find good elevator subsets for

routers. In addition, during high traffic loads, AdEle+ uses an adaptive and online elevator selec-

tion algorithm to select an elevator from the elevator subset to dynamically manage traffic con-

gestion on elevators. Moreover, in low congestion circumstances, AdEle+ switches to a minimal

distance selection to improve energy efficiency. Compared to state-of-the-art selection algorithms

under various PC-3DNoC configurations and traffic patterns, AdEle+ reduces the average latency

by 9.5% on average and up to 11.2% while reducing the hardware overhead by 10.1% due to its

efficient online selection in the routers.

2.1 Introduction

To create vertical links in TSV-based 3D NoCs, multiple TSVs are grouped into a bundle.

However, due to the large TSV interconnect pitch and keep-out-zone requirements [15], these TSV

22


bundles result in large area overhead. Moreover, TSVs are particularly susceptible to electromigra-

tion and capacitive crosstalk-induced issues [16,17]. Therefore, it is very costly to have TSV-based

vertical links at every router [6,7]. To address these challenges, Partially Connected 3D NoCs (PC-

3DNoCs) with TSV-based vertical links at only some routers have been proposed [1, 7, 18]. In

addition to reduced fabrication costs, the PC-3DNoC paradigm can be utilized to handle missing

vertical links due to TSV-based faults [19] and to improve manufacturing yields.

Since PC-3DNoCs remove some of the vertical links (a.k.a. elevators), the remaining elevators

must be shared among multiple routers. A well-known routing solution in PC-3DNoCs, Elevator-

First routing [1], naïvely select the nearest elevator without considering the overall network traffic,

potentially creating traffic hotspots at certain elevators and increasing the network latency [7]. To

mitigate the effects of these traffic hotspots, we can balance the traffic across all elevators using

an adaptive routing technique that selects elevators based on elevator utilization. For example,

Congestion-aware Dynamic elevator Assignment (CDA) [20] uses global traffic information to

improve the elevator load distribution during runtime.

Elevator-selection algorithms in PC-3DNoCs can be broadly classified into design-time and

runtime approaches. For example, in [1, 21], each router is assigned one elevator for all verti-

cal traffic at design time and to optimize network performance (e.g., the network latency). On

the other hand, online approaches like [22] monitor traffic and select elevators during runtime to

help balance the elevator load. Since design-time approaches do most of the calculations offline,

only simple look-up tables are required, oftentimes resulting in simpler and faster implementations

compared to an online approach. However, these offline approaches [1,21,23–25] rely only on the

traffic information available during design time and cannot adapt to new runtime traffic scenarios

or changing system conditions, e.g., faults. Although online solutions can help adjust to these

changes, they can impose large overhead. For example, in CDA [20, 22], gathering global traf-

fic information from distant routers and determining the selection at runtime impose significant

hardware and latency overhead.

23


This chapter addresses the elevator-selection problem in PC-3DNoC routing techniques by de-

veloping, a novel, congestion- and energy-aware adaptive elevator-selection scheme called AdEle+.

The main contributions of AdEle+ are summarized in the following:

• We propose AdEle+, a two-stage elevator-selection approach that takes advantage of both

design-time and runtime benefits: AdEle+ integrates a design-time elevator-set optimization

and a runtime elevator-selection policy to balance traffic with minimal overhead.

• We utilize a Multi-Objective Simulated-Annealing-based optimization algorithm (AMOSA [26])

offiline for the design-time elevator-set optimization. AMOSA chooses, for each router, an

optimized subset of elevators that will be used during runtime selection to simplify the online

elevator selection.

• We utilize a Multi-Objective Simulated-Annealing-based optimization algorithm (AMOSA [26])

offiline for the design-time elevator-set optimization. AMOSA chooses, for each router, an

optimized subset of elevators that will be used during runtime selection to simplify the online

elevator selection.

• We develop a low-cost runtime elevator selection that has two modes: one for high traffic

loads using local traffic information and a simple modified round-robin technique to improve

the elevator load; and another for low traffic loads using a distance-based elevator selection

that considers the elevator distance to both source and destination.

• An algorithm is presented to find the optimized threshold for switching between the different

runtime elevator selection modes.

Our simulation results, under different synthetic and real-application traffics with various PC-

3DNoC configurations, show the promise of AdEle+ compared to state-of-the-art. For instance,

AdEle+ improves the network latency by 10%, on average, and by up to 13.9%. AdEle+ also

reduces the energy consumption in low traffic scenarios. Moreover, due to the local traffic mon-

itoring of AdEle+, hardware overhead is reduced, compared to CDA selection, in which global

traffic information is shared among routers.

24


Minimal path Non-minimal path

D

S

e1

e2

e3

E
lev

ato
r

Elevator 

router

Elevator-less 

router

(a) (b)

Figure 2.1: (a) An example PC-3DNoC with three elevators (e1, e2, and e3). The routing path from S to

D based on Elevator-First algorithm [1] (dotted-red line) and the minimal path (blue-solid line) are shown.

The middle-layer routers are colored based on their Elevator-First selected elevator. (b) Traffic load on each

router in the middle layer: the e2 elevator is highly congested because of the inefficient elevator selection in

Elevator-First algorithm (7 out of 16 routers use this elevator router).

The rest of the chapter is organized as follows. We present the background and review some

prior related work on PC-3DNoCs in Section 2.2. Section 2.3 discusses the elevator-selection prob-

lem and its complexity in PC-3DNoCs. Moreover, it details our proposed technique (AdEle+) and

its implementation. In Section 2.4, we explore the parameters of AdEle+ to optimize the elevator

selection. In addition, Section 2.4 presents our simulation results including latency, energy, power,

and hardware. Finally, Section 2.5 concludes the chapter.

2.2 Background and Related Work

This section discusses PC-3DNoCs and related challenges in routing, elevator placement, and

elevator selection.

2.2.1 Partially Connected 3D Networks-on-Chip

The total number of TSVs highly affects the overall cost of 3D chips [27, 28]. Since each

elevator link includes tens to hundreds of TSVs (e.g., 2 × 128 bits in [29]), limiting elevator

links to only some routers—a Partially Connected 3D NoC (PC-3DNoC)—can help significantly

25


improve the fabrication cost [1, 6, 18, 30]. For example, a PC-3DNoC with elevators at 25% of the

routers significantly improves the chip fabrication cost with only 18.7% overhead in the network

latency, compared to a fully connected 3D NoC [31]. Moreover, PC-3DNoC schemes can be better

extended to deal with TSV faults as they are designed to work in a partially connected network with

missing TSV links [19]. Note that our work assumes a given PC-3DNoC topology and thus can

handle manufacturing TSV faults. Considering runtime TSV faults are beyond the scope of this

chapter.

As shown in Fig. 2.1a, PC-3DNoCs consist of two types of routers: elevator-less routers with

five ports (east, south, west, north, and local), and elevator routers that include two additional ports

(up and down) for a total of seven ports. As elevator-less routers are not directly connected to the

other layers, their inter-layer packets must first route to an elevator router.

2.2.2 Elevator Placement and Routing Algorithms

Similar to elevator selection during routing in PC-3DNoCs (see Section 2.3), elevator place-

ment also plays an important role in the network performance. In [31], the authors proposed an

elevator placement that minimizes the hop count from all routers to nearby elevators. It is assumed

that elevator-less routers utilize their closest elevator in the layer. Considering this assumption,

an optimized placement is found to minimize the average router to elevator distance. However, in

runtime network operation, different elevators might be selected for elevator-less routers, which is

not consistent with the assumption in the placement optimization. To this end, [20] uses a Genetic

Algorithm (GA) to place elevators while considering elevator selection to minimize the average

distance and load variance. However, these approaches will likely end up with a non-uniform ele-

vator placement, after manufacturing faults, which can lead to non-uniform elevator usage. Even

with uniform elevator placement, traffic is typically non-uniform leading to imbalanced elevator

utilization. Fortunately, an adaptive elevator selection algorithm can compensate for these effects

and balance the traffic over the elevators.

26


Elevator 

Configuration

Traffic Pattern

Objective 1: 

Elevator Utilization

Online Elevator Selection (Section ΙΙΙ.C)

Offline Optimization (Section ΙΙΙ.B)

Elevator-Set Search 

(AMOSA)

Router 0

Router 1

Router 

N-1 Routing Computation

Elevator Selection

Enhanced 

Round Robin (ERR)

Switch between 

DB and ERR

Traffic 

Monitor
Elevator Subset

Elevator Subsets

 (one per router)

Objective 2 

Average Distance

Distance Based (DB)

Closest elevator 

(per quadrant) 

Figure 2.2: An overview of our proposed elevator-selection algorithm: AdEle+.

Many routing algorithms have been proposed for PC-3DNoCs [1,6,18,19,24,30,32–34]. They

are mainly different in their approach to guarantee deadlock freedom. For instance, in Elevator-

First [1], the most popular deadlock-free routing for PC-3DNoCs, upward-bound packets are

routed in one virtual channel while downward-bound packets are routed in another virtual channel

to break the cyclic dependency and maintain deadlock freedom. However, Elevator-first is a de-

terministic routing algorithm and it suffers from a low adaptation in path selection. To this end,

LEAD [24] offers an adaptive routing algorithm while guaranteeing deadlock freedom by defining

some subnetworks. However, all of these routing algorithms employ a simple selection, where the

closest elevator to the source router is used for inter-layer routing. Although simple, this selection

results in an unbalanced traffic load over elevators and, therefore, unnecessary network congestion

and performance loss.

2.2.3 Elevator Selection Algorithms

In PC-3DNoCs, the elevator selection process can highly impact the traffic distribution across

the elevators and consequently the performance of the entire network. Unfortunately, elevator

selection in PC-3DNoCs has been barely studied in related work. In conventional routing algo-

rithms, usually the closest elevator is selected to route inter-layer packets [1, 32, 33]. However,

this selection ignores the elevators’ load distribution and can lead to network congestion. This

can be especially harmful for PC-3DNoCs with non-uniform elevator placements, small number

of elevators, or non-uniform traffic distribution. Adaptive elevator-selection techniques have been

proposed [6, 18, 30], but they mainly focus on designing fault-tolerant approaches to handle eleva-

27


tor failures. These strategies select the closest non-faulty elevator to the source without considering

the elevator’s load and congestion.

To improve the traffic distribution in PC-3DNoCs, [21] proposed an offline optimized elevator-

selection algorithm using the Tabu search algorithm to distribute the load over elevator links and

reduce network latency. However, this approach does not consider network energy and cannot

capture the dynamics of the runtime network traffic due to its offline nature. In [24], a simple online

selection strategy was presented where routers select one elevator randomly. This random selection

improves the elevator traffic distribution compared to the closest elevator selection. However,

it increases the average hop count and energy consumption as some packets will have to travel

much farther to distant elevators. In [20] and [22], an online Congestion-aware Dynamic elevator

Assignment (CDA) was proposed. CDA selects the elevator that minimizes the sum of the delay

and the buffer utilization of the routers between the packet’s source and the elevator. However,

CDA requires online global information of the routers’ delay and buffer utilization, sharing of

which imposes high overhead.

Considering the aforementioned efforts, an efficient elevator-selection solution is yet to be re-

alized for PC-3DNoCs. In the rest of this chapter, we propose an Adaptive congestion- and energy-

aware Elevator-selection algorithm (AdEle) to optimize a subset of elevators for each router, then

leverages such subsets in runtime to dynamically avoid congested elevators and improve traffic

while relying only on local information. Employing a set of elevators instead of one elevator per

router enables the online selection to adapt to new traffic patterns, hence improving the network

performance. Moreover, we realize a low-cost Distance-Based (DB) elevator selection for the low

congestion scenarios to improve the energy efficiency. In addition, an algorithm is proposed for

switching between the traffic-aware and distance-based selections to enable AdEle+ to account for

a dynamic traffic load. We explore design parameters of AdEle+ and optimize them to signifi-

cantly improve the performance. We also show the effectiveness of the distance-based elevator

selection, under low traffic load scenarios, and the dynamic switching between the distance-based

and congestion aware selections.

28


2.3 Proposed Elevator-Selection Algorithm: AdEle+

This section discusses the main challenges for elevator selection in PC-3DNoCs and details

our proposed adaptive congestion- and energy-aware elevator-selection algorithm, AdEle+. As an

overview, Fig. 2.2 shows the building blocks of the proposed algorithm: AdEle+ uses an offline

multi-objective simulated-annealing-based algorithm (AMOSA) to find an optimal subset of ele-

vators for each router and an online elevator-selection algorithm to then select the best elevator

from the subset in the presence of runtime traffic.

2.3.1 Motivation: Routing in PC-3DNoCs

In PC-3DNoCs, the routing process requires three main steps because of the irregular topology:

1) selecting a vertical link (elevator) for each packet in the source router and then routing the packet

to that elevator on the source layer; 2) vertically routing the packet to the destination layer; and

3) routing the packet from the elevator to the destination node on the destination layer. In PC-

3DNoCs, the elevator selection (the first step) is critical as elevators quickly become the network

bottleneck due to the smaller number of elevators. As we will show, AdEle+ optimizes the elevator

selection in the first step to balance the traffic over the elevators, thereby reducing network traffic

hot-spots.

Fig. 2.1a shows an example of a PC-3DNoC with three elevators (e1–e3). Router tiles are

colored with the elevator’s color they would use under the Elevator-First policy (i.e., the closest

elevator to the source router is selected) [1,6,18], i.e., four routers will use the green (e1) elevator,

seven will use the blue (e2) elevator, and five will use the red (e3) elevator. Unfortunately, such

an unbalanced elevator selection can put severe traffic pressure on certain elevators (e2 in this ex-

ample). Ideally, some of the load on e2 should be assigned to e1 or e3, making e2 less congested.

Fig. 2.1b demonstrates the utilization of the middle-layer routers with Elevator-First selection pol-

icy under uniform traffic. This confirms that e2 is highly congested due to the unbalanced elevator

selection. In terms of energy efficiency in low traffic loads, the best elevator selection is on the min-

imal path between the source and destination. However, for the path between S and D in Fig. 2.1a,

29


policies that ignore the location of