DISSERTATION DESIGN AND OPTIMIZATION OF EFFICIENT, FAULT-TOLERANT AND SECURE 2.5D CHIPLET SYSTEMS Submitted by Ebad Taheri Department of Electrical and Computer Engineering In partial fulfillment of the requirements For the Degree of Doctor of Philosophy Colorado State University Fort Collins, Colorado Fall 2024 Doctoral Committee: Advisor: Mahdi Nikdast Co-Advisor: Sudeep Pasricha Yashwant K. Malaiya Anura P. Jayasumana Copyright by Ebad Taheri 2024 All Rights Reserved ABSTRACT DESIGN AND OPTIMIZATION OF EFFICIENT, FAULT-TOLERANT AND SECURE 2.5D CHIPLET SYSTEMS In response to the burgeoning demand for high-performance computing systems, this Ph.D. dissertation investigates the pivotal challenges surrounding Networks-on-Chip (NoCs) within the framework of 2.5D and 3D integration technologies, with a primary objective of enhancing the efficiency, fault tolerance, and security of forthcoming computing system architectures. The in- herent limitations in bandwidth and reliability at the boundary of chiplets in 2.5D chiplet systems engender significant challenges in traffic management, latency, and energy efficiency. Further- more, the interconnected global network on an interposer, linking multiple chiplets, necessitates high-bandwidth, low-latency communication to accommodate the substantial traffic generated by numerous cores across diverse chiplets. This Ph.D. dissertation emphasizes various design aspects of NoCs, such as latency, energy efficiency, fault tolerance, and security. It explores the design of 3D NoCs leveraging Through-Silicon Vias (TSVs) for vertical communication. To address reliabil- ity concerns and fabrication costs associated with high TSV density, Partially Connected 3D NoC (PC-3DNoC) has been proposed. An adaptive congestion-aware TSV link selection algorithm is introduced to manage traffic load and optimize communication, resulting in reduced latency and improved energy efficiency. For 2.5D chiplet systems, a novel deadlock-free and fault-tolerant routing algorithm is presented. The fault-tolerant algorithm enhances redundancy in vertical link selection and offers improved network reachability with reduced latency compared to existing so- lutions, even in the presence of faults. Furthermore, to address the energy consumption concerns of silicon-photonic-based 2.5D networks, a reconfigurable power-efficient and congestion-aware silicon-photonic-based 2.5D Interposer network is proposed. The proposed photonic interposer utilizes phase change materials (PCMs) for dynamic reconfiguration and power gating of the pho- ii tonic network, leading to lower latency and improved energy efficiency. Additionally, the research investigates the integration of optical computation and communication into 2.5D chiplet platforms for domain-specific machine learning (ML) processing. This approach aims to overcome limita- tions in computation density and communication speeds faced by traditional accelerators, paving the way for sustainable and scalable ML hardware. Furthermore, this dissertation proposes a 2.5D chiplet-based architecture utilizing a silicon-photonic-based interposer, which tackles the limi- tations of conventional bus-based communication by employing a novel switch-based network, achieving significant energy efficiency improvements for high-bandwidth, low-latency data move- ment in machine learning accelerators. The switch-based network employs our proposed optical switch based on Mach–Zehnder Interferometer (MZI) devices with a dividing state to facilitate broadcast and optimize communication for ML workloads. Finally, the dissertation explores secu- rity considerations in 2.5D chiplet systems with diverse, potentially untrusted chiplets. To address this, a secure routing framework for Network-on-Interposer is presented. The proposed secure framework protects the system against distributed denial-of-service (DDoS) attacks by conceal- ing predictable routing paths. It leverages multi-objective optimization to balance efficiency and reliability for the NoI. The proposed contributions in this dissertation help advance the field of chip-scale interconnection networks by proposing novel techniques for improved performance, reliability, and power efficiency in 3D and 2.5D NoC architectures. These advancements hold promise for the design of future high-performance computing systems, particularly in the areas of machine learning and other computationally intensive applications. iii ACKNOWLEDGEMENTS I would like to express my deepest gratitude to my advisor, Prof. Mahdi Nikdast, for his in- valuable guidance, unwavering support, and endless patience throughout the journey of this disser- tation. His expertise, encouragement, and constructive feedback have been instrumental in shaping this work. Whenever deadlines loomed, he devoted his time to ensure our progress. Beyond the technical realm, Prof. Nikdast extended his support to me during personal hardships, such as the challenging period of pandemic-induced separation from my wife. His unwavering commitment began from my first email to him in Iran, my home country, and continues to this day, five years later. His dedication to our growth and success is truly commendable. I am also grateful to my co-advisor, Prof. Sudeep Pasricha, for his insightful input, mentor- ship, and dedication to helping me navigate the complexities of my research. His perspective and encouragement have been immensely valuable in refining my ideas and methodologies. Prof. Pas- richa generously devoted his time to assist me in every step of this project, and I consider myself fortunate to have such a knowledgeable and supportive mentor. I extend my heartfelt thanks to the members of my committee, Prof. Anura P. Jayasumana and Prof. Yashwant K. Malaiya, for their time, expertise, and constructive criticism. Their diverse perspectives and thoughtful insights have greatly enriched the quality of this dissertation. Special gratitude goes to my mentor and collaborator outside Colorado State University. Prof. Ahmad Patooghy, my master’s degree advisor, shaped my mindset and encouraged me to pursue a Ph.D. I also want to express my appreciation to Prof. Rayan Kim, who provided significant assis- tance and collaboration during the first two years of my Ph.D. project. Additionally, I am thankful to Prof. Nader Sehatbaksh and his student, Pooya Aghanoury, at UCLA for their collaboration, as well as Prof. Kaveh Rahbardar Mojaver for their assistance in my Ph.D. project. To my collaborators and labmates, including Amin Mahdian, Kamil Khan, Febin Sunny, Alex Qi, and Asif Mirza, I express my gratitude for your camaraderie, support, and intellectual ex- change. I have learned a great deal from each of you and am thankful for the ideas we have built iv together. Additionally, I am grateful to my colleagues at EPIC lab and ECSyD lab. Your friendship and collaboration have enriched this journey, making it both enjoyable and rewarding. I owe a debt of gratitude to my wife, Mydeh, for her unwavering love, understanding, and patience. Her encouragement, support, and sacrifices have been the bedrock of my success, and I am profoundly grateful for her presence in my life. Last but not least, I want to thank my parents for their unconditional love, support, and belief in me. Their encouragement and sacrifices have been a source of strength and motivation throughout this endeavor. This dissertation would not have been possible without the support and encouragement of all those mentioned above, as well as many others who have contributed in ways both seen and unseen. Thank you all for being part of this transformative journey. v LIST OF RESEARCH PUBLICATIONS 1. E. Taheri, M. A. Mahdian, S. Pasricha, and M. Nikdast, "SwInt: A Non-Blocking Switch- Based Silicon Photonic Interposer Network for 2.5D Machine Learning Accelerators", IEEE Journal on Emerging and Selected Topics in Circuits and Systems. 2. E. Taheri, S. Pasricha, and M. Nikdast, "ReD: A Reliable and Deadlock-Free Routing Algo- rithm for 2.5D Chiplet Networks", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 3. E. Taheri, P. aghanoury, S. Pasricha, M. Nikdast, and N. Sehatbakhsh, "SCRIPT: A Multi- Objective Routing Framework for Securing Chiplet Systems against Distributed DoS Attacks", Proceedings of the Great Lakes Symposium on VLSI. 4. M. A. Mahdian, E. Taheri, K. Rahbardar Mojaver, and M. Nikdast, “Photonic Physically Un- clonable Functions using Ring-Assisted Contra-Directional Couplers”, IEEE/Optica Optical Fiber Communication (OFC) Conference, 2024. 5. F Sunny, E. Taheri, M Nikdast, and S Pasricha, "Silicon Photonic 2.5 D Interposer Networks for Overcoming Communication Bottlenecks in Scale-out Machine Learning Hardware Accel- erators", IEEE VLSI Test Symposium, 2024. 6. E. Taheri, M. A. Mahdian, S. Pasricha, and M. Nikdast, "TRINE: A Tree-Based Silicon Pho- tonic Interposer Network for Energy-Efficient 2.5 D Machine Learning Acceleration", Proceed- ings of the 16th International Workshop on Network on Chip, 2023. 7. M. A. Mahdian, E. Taheri, and M. Nikdast, "Pars: A power-aware and reliable control plane for silicon photonic switch fabrics", International Conference on Photonics in Switching and Computing (PSC), 2023. 8. E. Taheri, R. G. Kim, and M. Nikdast, "AdEle+: An Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Connected 3D Networks-on-Chip", IEEE Transactions on Com- puters, 2023. vi 9. F Sunny, E. Taheri, M Nikdast, and S Pasricha, "Machine Learning Accelerators in 2.5 D Chiplet Platforms with Silicon Photonics", IEEE/ACM Design, Automation and Test in Europe (DATE) Conference and Exhibition, 2023. 10. E. Taheri, S. Pasricha, and M. Nikdast, "ReSiPI: A Reconfigurable Silicon-Photonic 2.5D Chiplet Network with PCMs for Energy-Efficient Interposer Communication", IEEE/ACM In- ternational Conference on Computer-Aided Design (ICCAD), 2022. 11. E. Taheri, S. Pasricha, and M. Nikdast, "DeFT: A Deadlock-Free and Fault-Tolerant Routing Algorithm for 2.5D Chiplet Systems", IEEE/ACM Design, Automation and Test in Europe (DATE) Conference and Exhibition, 2022. 12. F. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A Survey on Silicon Photonics for Deep Learning,” ACM Journal on Emerging Technologies in Computing Systems (JETC) , 2021. 13. E. Taheri, R. G. Kim, and M. Nikdast, “AdEle: An Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Connected 3D NoCs,” IEEE/ACM Design Automation Confer- ence (DAC), 2021. 14. A. Mirza, S. Manafi Avari, E. Taheri, S. Pasricha, and M. Nikdast, “Opportunities for Cross- Layer Design in High-Performance Computing Systems with Integrated Silicon Photonic Net- works”, IEEE/ACM Design, Automation and Test in Europe (DATE) Conference and Exhibi- tion, 2020. vii TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF RESEARCH PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Multicore and system-on-chip . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 On-chip communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Router architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.5 Deadlock recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.6 Deadlock avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.7 3D Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.8 2.5D Networks-on-Chip (chiplet systems) . . . . . . . . . . . . . . . . 13 1.4 Silicon Photonic Networks-on-Chip . . . . . . . . . . . . . . . . . . . . . 15 1.5 NoC design challenges in 2.5D and 3D chip technologies . . . . . . . . . 17 1.5.1 Network latency and traffic distribution . . . . . . . . . . . . . . . . . . 17 1.5.2 Fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5.3 Hardware security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5.4 Energy efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Dissertation motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 2 Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Con- nected 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Partially Connected 3D Networks-on-Chip . . . . . . . . . . . . . . . . 25 2.2.2 Elevator Placement and Routing Algorithms . . . . . . . . . . . . . . . 26 2.2.3 Elevator Selection Algorithms . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Proposed Elevator-Selection Algorithm: AdEle+ . . . . . . . . . . . . . . 29 2.3.1 Motivation: Routing in PC-3DNoCs . . . . . . . . . . . . . . . . . . . 29 2.3.2 Offline Optimization in AdEle+ . . . . . . . . . . . . . . . . . . . . . . 30 2.3.3 Adaptive Online Elevator Selection . . . . . . . . . . . . . . . . . . . . 32 2.3.4 Elevator Selection in AdEle+ under Low Traffic . . . . . . . . . . . . . 34 2.3.5 Adaptive Selection Between Distance-Based or Enhanced Round-Robin 37 viii 2.4 Experimental Results and Evaluations . . . . . . . . . . . . . . . . . . . . 38 2.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4.2 Parameter Exploration and Optimization . . . . . . . . . . . . . . . . . 41 2.4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 3 A Reliable and Deadlock-Free Routing Technique for 2.5D Chiplet-based Interposer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.1 Active interposers and their challenges . . . . . . . . . . . . . . . . . . 57 3.2.2 Deadlock-Free Routing in Active Interposers . . . . . . . . . . . . . . . 58 3.2.3 Fault-Tolerant Routing in Conventional 2D Networks . . . . . . . . . . 60 3.2.4 Fault-Tolerant Routing in 2.5D Chiplet Systems . . . . . . . . . . . . . 61 3.3 ReD: Proposed Routing Algorithm . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 Proposed VN Separation and Deadlock-Freedom . . . . . . . . . . . . 63 3.3.2 Proposed Fault-Tolerant Congestion-Aware VL Selection . . . . . . . . 66 3.3.3 Fault-Tolerant Routing to Handle Horizontal Link Faults . . . . . . . . 72 3.3.4 Sharing Fault Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3.5 Deadlock- and Livelock-Freedom . . . . . . . . . . . . . . . . . . . . . 79 3.4 Evaluation and Simulation Results . . . . . . . . . . . . . . . . . . . . . . 81 3.4.1 Simulation Environment and Configuration . . . . . . . . . . . . . . . 81 3.4.2 Latency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.3 Fault-Tolerance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.4.4 Hardware and Power Analysis . . . . . . . . . . . . . . . . . . . . . . 89 3.4.5 Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 4 A Reconfigurable Silicon-Photonic 2.5D Chiplet Network with PCMs for Energy-Efficient Interposer Communication . . . . . . . . . . . . . . . . . . 91 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.1 Chiplet systems and electronic interposers . . . . . . . . . . . . . . . . 94 4.2.2 Silicon photonic interposers . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.3 PCM-based silicon photonic devices . . . . . . . . . . . . . . . . . . . 95 4.3 ReSiPI: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.1 Dynamic gateway management . . . . . . . . . . . . . . . . . . . . . . 97 4.3.2 ReSiPI interposer network architecture . . . . . . . . . . . . . . . . . . 98 4.3.3 Adaptive active gateway selection . . . . . . . . . . . . . . . . . . . . 101 4.3.4 Per-packet gateway selection . . . . . . . . . . . . . . . . . . . . . . . 103 4.3.5 Reconfiguration controller architecture . . . . . . . . . . . . . . . . . . 105 4.4 Simulation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.2 Design-space exploration: Optimal Lm . . . . . . . . . . . . . . . . . . 108 4.4.3 ReSiPI controller overhead . . . . . . . . . . . . . . . . . . . . . . . . 110 ix 4.4.4 Latency, power, and energy analysis . . . . . . . . . . . . . . . . . . . 111 4.4.5 Adaptivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.4.6 Bandwidth distribution analysis . . . . . . . . . . . . . . . . . . . . . . 113 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 5 Machine Learning Accelerators in 2.5D Chiplet Platforms with ReSiPI-based Silicon Photonic Interposer . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2 Case study: Silicon photonic 2.5D DNN accelerator . . . . . . . . . . . . 116 5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4 Conclusion and open challenges . . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 6 A Non-Blocking Silicon Photonic Switch-Based Interposer Network for 2.5D Machine Learning Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2.1 Silicon Photonic Communication . . . . . . . . . . . . . . . . . . . . . 126 6.2.2 Silicon Photonic Interposers . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 Proposed Interposer Network: SwInt . . . . . . . . . . . . . . . . . . . . 129 6.3.1 Architecture of the 2.5D Accelerator in SwInt . . . . . . . . . . . . . . 129 6.3.2 Motivation for Switch-based Network Architectures . . . . . . . . . . . 129 6.3.3 SwInt Switch Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.3.4 Routing Strategies and Optimization in SWInt . . . . . . . . . . . . . . 136 6.3.5 Multicast and Broadcast Capabilities in SWInt . . . . . . . . . . . . . . 136 6.4 Design and evaluation of SwInt’s devices . . . . . . . . . . . . . . . . . . 137 6.4.1 MZS Design and Optimization . . . . . . . . . . . . . . . . . . . . . . 137 6.4.2 Micro Ring Resonator Design and Optimization . . . . . . . . . . . . . 141 6.5 SwInt’s Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . 144 6.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.6 SwInt with Bus Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.6.1 Switch-based and bus-based architectures . . . . . . . . . . . . . . . . 147 6.6.2 SwInt with Bus Integration: Design Trade-offs . . . . . . . . . . . . . 148 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Chapter 7 SCRIPT: A Multi-Objective Routing Framework for Securing Chiplet Sys- tems against Distributed DoS Attacks . . . . . . . . . . . . . . . . . . . . . 151 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 153 7.3 Threat Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . 155 7.4 SCRIPT: Proposed Performance- and Security-aware Routing Framework . 156 7.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.4.2 SCRIPT Routing Process . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.4.3 Enhancing Security in VLs . . . . . . . . . . . . . . . . . . . . . . . . 157 7.4.4 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . 159 x 7.4.5 Obfuscated Deadlock-Free Routing . . . . . . . . . . . . . . . . . . . . 161 7.5 Evaluation and simulation results . . . . . . . . . . . . . . . . . . . . . . 164 7.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.5.2 Multi-Objective Optimization Results . . . . . . . . . . . . . . . . . . 165 7.5.3 Security and Latency Analysis . . . . . . . . . . . . . . . . . . . . . . 166 7.5.4 Area and Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Chapter 8 Summary and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 8.2.1 Integration of AdEle+ and ReD . . . . . . . . . . . . . . . . . . . . . . 172 8.2.2 Optimize ReSiPI Architectures to Address Non-Ideal Frequency Re- sponse of PCMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 8.2.3 On-chip Network Monitoring and Machine Learning . . . . . . . . . . 172 8.2.4 Co-design of Communication Architectures and Machine Learning Work- loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.2.5 Security Enhancements in SCRIPT . . . . . . . . . . . . . . . . . . . . 173 8.2.6 Scaling SwInt for Large-scale Systems . . . . . . . . . . . . . . . . . . 173 8.2.7 Efficient electronic controllers . . . . . . . . . . . . . . . . . . . . . . 173 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 xi LIST OF TABLES 2.1 Simulation setup of AdEle+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.2 Performance of selected solutions from Fig. 2.6 . . . . . . . . . . . . . . . . . . . . . 42 2.3 trDB for different elevator placements . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4 Area overheads for the different elevator selection algorithms . . . . . . . . . . . . . . 52 3.1 Abbreviations used in ReD’s chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2 Simulation setup of ReD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3 Area and power analysis of ReD, MTR, and RC . . . . . . . . . . . . . . . . . . . . . 88 4.1 Simulation setup of ReSiPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2 Overhead analysis of ReSiPI’s controller (see Fig. 4.7). . . . . . . . . . . . . . . . . . 111 6.1 Simulation setup of SwInt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.1 Area and power: SCRIPT vs. DeFT and MTR . . . . . . . . . . . . . . . . . . . . . . 169 xii LIST OF FIGURES 1.1 On chip communication paradigm: (a) In point-to-point communication each core is directly connected to the other core using a physical link, (b) In Bus-based commu- nicaiton the bus is shared between cores, and (c) In Network-on-chip, simultaneous communication between core can be handled. . . . . . . . . . . . . . . . . . . . . . . 3 1.2 (a) a Networks-on-Chip with four cores, and (b) the corresponding topology of this network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Network-on-Chip topology examples: (a) mesh, (b) torus, and (c) butterfly. . . . . . . 4 1.4 NoC router architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Routing example: (a) XY routing. (b) turn model of XY routing . . . . . . . . . . . . 8 1.6 Deadlock free turn model: (a) west-first turn model, (b) north-last turn model, and (c) negative-first turn model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Deadlock turn model examples: (a) the turns in the right-hand cycle create cyclic de- pendency, (b) the combination of the turns in both cycles can create cyclic dependency, and (c) an example of the combination from the avoided turn in b resulted in a deadlock cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.8 Sub network to achieve deadlock freedom while offering fully adaptive routing: (a) adaptive routing from south to north (e.g., S1 to D1, and (b) adaptive routing from north to south (e.g., S2 to D2.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.9 An example of a 2.5D chiplet system with four chiplets connected through an inter- poser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.10 Data transmission on a silicon photonic interposer. . . . . . . . . . . . . . . . . . . . 15 2.1 (a) An example PC-3DNoC with three elevators (e1, e2, and e3). The routing path from S to D based on Elevator-First algorithm [1] (dotted-red line) and the minimal path (blue-solid line) are shown. The middle-layer routers are colored based on their Elevator-First selected elevator. (b) Traffic load on each router in the middle layer: the e2 elevator is highly congested because of the inefficient elevator selection in Elevator- First algorithm (7 out of 16 routers use this elevator router). . . . . . . . . . . . . . . . 25 2.2 An overview of our proposed elevator-selection algorithm: AdEle+. . . . . . . . . . . 27 2.3 (a) Closest elevator selection example shows that 11 out of 36 destination routers in the bottom layer use non-minimal paths from source S (red destinations receive packets non-minimally). (b) Distance-Based (DB) elevator selection in AdEle+ shows exam- ple quadrants (RNW , RNE , RSW , RSE) based on S. Each elevator is color-coded based on their quadrant (with some elevators with two colors), and the regional clos- est elevator (RCE) for each region is marked with a color-coded star. Similarly, the closest elevator (CE) is marked with a black star (e5). Our DB elevator selection will consider the paths through the RCE in the destination quadrant and the overall closest elevator. Our DB elevator selection only routes 2 out of 36 destinations non-minimally from source S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xiii 2.4 Comparison of different elevator selection policies in terms of average distance under different network sizes and number of elevators. For each network size and number of elevators, 100 random elevator-placement patterns are evaluated. Here, we report both the average (average case) and the maximum (worst case) of all the 100 random elevator patterns’ average distance. (a) 4× 4 layer size - average case; (b) 4× 4 layer size - worst case; (c) 8× 8 layer size - average case; and (d) 8× 8 layer size - worst case. 36 2.5 Proposed framework to find trDB. ERR and distance-based elevator selections are simulated under different injection loads. The average cost (2.8) of the load where latency of both ERR and distance-based selections are equal is used as the minimal threshold (trDB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6 Elevator-subset solutions found by AMOSA in AdEle+. . . . . . . . . . . . . . . . . . 41 2.7 Impact of elevator set size on (a and c) uniform traffic and (b and d) shuffle traffic. . . . 42 2.8 Comparison of minimal versus ERR elevator selection under Uniform traffic: (a) av- erage latency and (b) energy per flit. This is used to find trDB. . . . . . . . . . . . . . 43 2.9 Comparison of ERR selection under Uniform traffic with different values of trDB for PL in (a) average latency and (b) energy per flit. . . . . . . . . . . . . . . . . . . . . . 44 2.10 Average latency for Elevator-First, CDA, AdEle, AdEle_RR, and AdEle+ under uni- form traffic and using different elevator placements. . . . . . . . . . . . . . . . . . . . 45 2.11 Average latency for Elevator-First, CDA, AdEle, AdEle_RR, and AdEle+ under shuf- fle traffic and using different elevator placements. . . . . . . . . . . . . . . . . . . . . 46 2.12 Comparison of elevator traffic distribution for Elevator-First (ElevFirst), CDA, and AdEle+ in terms of (a) traffic load over elevators (blue, green, and red) normalized to the average load over horizontal links (white bars) of Elevator-First; (b) average flit residency over elevators; and (c) average flit residency of all elevators normalized to average flit residency of horizontal links. AdEle+_DB shows DB selection of AdEle+. 46 2.13 Normalized energy per flit for Elevator-First, CDA, AdEle, AdEle_RR, and AdEle+ under uniform traffic with different injection rates. . . . . . . . . . . . . . . . . . . . . 47 2.14 Latency for Elevator-First (ElevFirst), CDA, AdEle and AdEle+ normalized to Elev- First under real-application traffic with different elevator placements. . . . . . . . . . . 49 2.15 Energy for Elevator-First (ElevFirst), CDA, AdEle, and AdEle+ normalized to Elev- First under real-application traffic with different elevator placements. . . . . . . . . . . 50 2.16 Packet size effect on latency and energy-per-flit. . . . . . . . . . . . . . . . . . . . . . 51 3.1 An abstract overview of the baseline 2.5D network with four chiplets on an active interposer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 ReD’s rules for VN utilization and deadlock-freedom. . . . . . . . . . . . . . . . . . . 62 3.3 Examples for a VL selection in a chiplet with 16 routers: (a) A fault-free distance- based selection (closest VL is selected) under uniform traffic, (b) A good selection under non-uniform traffic, and (c) A distance-based selection with a VL fault and under uniform traffic, suffering from imbalanced traffic load on VLs. Here, T and l denote the inter-chiplet traffic rate of routers and VLs, respectively. Also, a router’s color represents its selected VL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Deadlock-free turn models used for intra-chiplet and -interposer routing in (a) the first VN and (b) the second VN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 xiv 3.5 Minimal routing to bypass HL faults (A2E: adaptive-to-east VN, A2W: adaptive-to- west VN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6 Non-minimal routing to get around HL faults (A2E: adaptive-to-east VN, A2W: adaptive- to-west VN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.7 Proposed Hamiltonian-based approach for sharing faulty/healthy updates of VLs: (a) Hamiltonian-based routing is used to distribute information through increasing and de- creasing IDs, routed in the first and second VNs respectively, (b) and (c) two Hamiltonian- based routings (green and black) are used for fault tolerance when sharing information, specifically designed to handle HL faults. . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8 Spare VLs used for packets which failed to receive VL updates and are routed to a faulty VL: (a) Spare VL on chiplets and (b) Spare VL on the interposer. . . . . . . . . 76 3.9 Average latency comparison among ReD, MTR, and RC routing algorithms when ap- plied to 2.5D networks with four-chiplet system. The comparison is conducted under different synthetic traffic patterns: (a) Uniform, (b) Localized, and (c) Hotspot. . . . . 81 3.10 Average latency comparison under different system sizes and Uniform traffic. (a) 6 chiplets, (b) 8 chiplets, and (c) 12 chiplets. . . . . . . . . . . . . . . . . . . . . . . . . 81 3.11 VC utilization of ReD under Uniform, Localized, and Hotspot traffic patterns. . . . . . 82 3.12 Latency improvement under real-application traffic using (a) a single application, and (b) two applications simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.13 Reachability in the presence of VL faults in a system with four chiplets. Note that ReD-Wrst. and ReD-Avg. are the same (both shown by ReD). . . . . . . . . . . . . . . 85 3.14 Reachability in the presence of VL faults in a system with (a) four chiplets (total VLs=32), and (b) twelve chiplets (total VLs=96). Note that ReD-Wrst. and ReD- Avg. are the same (both shown by ReD). . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.15 Average latency in ReD with different VL-selection strategies, fault-tolerant approaches, and fault-injection rates for a four-chiplet system. (a) VL faults, (b) HL faults on chiplets, (c) HL faults on the interposer, and (d) uniformly combined faults on VLs, HL faults on chiplets, and HL faults on the interposer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.16 Average percentage of updated routers when sharing the status of faulty/healthy VLs for (a) chiplets and (b) the interposer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.17 Normalized energy comparison for a four-chiplet network with (a) no faults, (b) 12.5% faulty VLs, and (c) 25% faulty VLs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1 Dynamic inter-chiplet bandwidth management: Design A with a larger number of wavelengths (e.g., four) and Design B, which is considered in ReSiPI, with a larger number of gateways and fewer (e.g., two) wavelengths. . . . . . . . . . . . . . . . . . 96 4.2 An example of the proposed photonic interposer architecture (ReSiPI) with a total of six gateways (one per chiplet) and four optical wavelengths. This architecture can be extended to have multiple gateways per chiplet. . . . . . . . . . . . . . . . . . . . . . 97 4.3 A PCM-based coupler (PCMC) in different states: (a) crystalline state to guide light to Bar (B) output, (b) partially crystalline state to guide a portion of light to the Cross (C) output and the rest to the Bar output, and (c) amorphous state to guide the input light to the Cross output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Number of active gateways based on load changes. . . . . . . . . . . . . . . . . . . . 103 4.5 Dynamic gateway management in ReSiPI. . . . . . . . . . . . . . . . . . . . . . . . . 104 xv 4.6 An example of the adaptive gateway selection in ReSiPI for different number of acti- vated gateways. The dashed boxes show the routers that will use a specific gateway (G). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.7 ReSiPI’s reconfiguration controller architecture. . . . . . . . . . . . . . . . . . . . . . 107 4.8 Average latency vs. optical link transmission rate. . . . . . . . . . . . . . . . . . . . . 109 4.9 (a) Normalized average latency, (b) Normalized average power, and (c) Normalized energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.10 Adaptivity comparison between ReSiPI and PROWAVES: (a) average delay, (b) aver- age power, (c) number of activated gateways in ReSiPI, and (d) number of activated wavelengths in PROWAVES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.11 Average residency of flits on the routers in the first chiplet in (a) PROWAVES and (b) ReSiPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.1 Overview of proposed 2.5D interposer chiplet-based DNN accelerator architecture. . . 117 5.2 Example of optical communication on interposer: MACs are reading data from memory.118 5.3 Silicon photonic network in our 2.5D chiplet-based DNN accelerator. Each MRG is connected to a gateway on a chiplet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4 Performance analysis of CrossLight, 2.5D-CrossLight with electronic interposer, and 2.5D-CrossLight with silicon photonic interposer, (a) normalized power consumption, (b) normalized total latency, and (c) normalized energy-per-bit. . . . . . . . . . . . . . 120 6.1 Chiplet system architecture considered in SwInt. . . . . . . . . . . . . . . . . . . . . . 126 6.2 A bidirectional silicon photonic link with four wavelengths to provide optical commu- nication between the GLB chiplet and a MAC chiplet. . . . . . . . . . . . . . . . . . . 127 6.3 (a) GLB to MAC chiplets communication via a bus-based approach, (b) MAC chiplets to GLB communication via a bus-based approach, (c) SwInt facilitating GLB to MAC chiplets communication using a switch-based interposer, and (d) SwInt enabling MAC chiplets to GLB communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.4 (a) MRR’s Through-port response and optical loss under different wavelengths. (b- c) power loss of filters in bus-based network compared to power loss of Benes and Butterfly networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.5 An example of (a-b) Butterfly topology with suboptimal input selection, (c) SwInt network for GLB to MAC chiplets communication, (d) SwInt network for MAC to GLB chiplets communication, and (e) Benes topology. Orange switches are removed compared to a typical Butterfly/Benes topology. Red switches suffer from blocking. . . 134 6.6 An example of a broadcast in SwInt, achieved using the Divide state in the MZI switch. 138 6.7 (a) An MZS including the MMI couplers and PN junction phase shifter. (b) an adia- batic MRR (wr2> wr1), and (c) SEM image of our fabricated Adevice. . . . . . . . . . 141 6.8 (a) Field distribution of Cross-state (b) Bar-state, and (c) Divide-state. . . . . . . . . . 142 6.9 (a) The response of the fabricated MMI. (b) The power imbalance between the two outputs of the MMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.10 Power imbalance effect of the "Divide" state on the scalability of broadcasting. . . . . 143 6.11 Power analysis: (a) Unicast state, and (b) Broadcast state. White portion in SPACX and SwInt shows power saving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 xvi 6.12 (a) Latency analysis, (b) energy analysis (the acronym represents models’ name listed in Section 6.5.1), and (c) scalability analysis. The results are normalized to average case (Avg) in SPRINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.13 Analyse blocking instances of Butterfly in comparison with SwInt. (a) 4 chiplets, (b) 8 chiplets, and (c) 12 chiplets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.14 SwInt with Bus, with 2 readers at each output (B = 2). . . . . . . . . . . . . . . . . . 148 6.15 Exploring optimal B (number of readers at each output of SwInt) in SwInt with Bus design under different number of wavelengths. . . . . . . . . . . . . . . . . . . . . . . 148 7.1 High-level overview of a 2.5D network with 4 chiplets, subdivided into 16 IPs. Each chiplet contains four boundary routers interface with the interposer. Chiplet 1 & 2 are malicious and transmit DDoS flood packets to Chiplet 3. . . . . . . . . . . . . . . . . 152 7.2 SCRIPT framework: (a) Design-time optimization takes into account the trust level and criticality of chiplets, along with performance objectives, to propose a list of Ver- tical Links (VLs) for use in the runtime routing. (b) In routing, a VL is selected, and the routing mode (obfuscated or regular) is determined to deliver the packet to the specified VL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.3 Normalized average distance overhead with two different patterns of chiplet-connected VLs (P1 and P2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.4 Multi-objective optimization and solution selection process. (a) Pareto front under weighted distance and utilization variance objectives. Five solutions on the Pareto front are selected for evaluation. (b) Latency results of the simulation under the se- lected solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.5 (a) Normalized average latency under real application scenarios without attacks. X- axis shows the initial two letters of each application. (b) Maximum average flit resi- dency in a router among all routers of NoI under different attack scenarios (results are averaged among all applications). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.6 Impact of an increased number of untrusted and attacker chiplets on (a) average la- tency, and (b) flit residency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 xvii Chapter 1 Introduction The relentless demand for high computational power has been a driving force in the semicon- ductor industry, leading to the development of innovative chip designs focused on scalability and performance enhancement. One notable solution that emerged from this pursuit is the integration of multiple cores onto a single chip, marking the era of multicore chips and system-on-chips (SoCs), where numerous processing/memory cores are integrated onto the same chip. However, utilizing numerous processing cores in chips with traditional communication infrastructures poses signifi- cant challenges. NoCs have been introduced as a suitable communication platform for inter-core communications within chips. The possibility of fabricating three-dimensional integrated circuits and consequently creating three-dimensional on-chip networks has been a significant achievement in chip fabrication, particularly in addressing the challenges of traditional chips, especially in terms of latency. However, three-dimensional integration leads to a substantial increase in power density in chips. The high power density in these networks creates hot spots, posing a threat to the relia- bility of chips. Furthermore, chiplet systems, as an innovative technology, improve scalability and fabrication cost by disintegrating large-scale chiplets and connecting them through a global inter- poser. 3D and 2.5D chiplets both pave the way for a high density of cores on the same chip, aiming for high computational power and energy efficiency, especially with recent advances in machine learning applications, where there is a high demand for parallel processing and high computational power with minimized energy consumption to address sustainability. However, there are significant energy efficiency, fault tolerance, and security challenges in providing efficient communication for such new technologies. Therefore, designing and optimizing efficient NoCs for these innovative technologies is of great importance. Silicon photonic technology, in which on-chip communication is enabled by exchanging data using optical signals, is also a great candidate to be used in global interposer networks, offering high bandwidth and low latency for relatively longer distances on the chip. 1 This chapter introduces multicore chips, SoCs, NoCs, 3D, and 2.5D integration. Moreover, it discusses how silicon photonics can be useful in enhancing the energy efficiency and performance of 2.5D chiplet systems. It also briefly discusses energy efficiency, fault tolerance, and security in these networks, and briefly outlines the objectives of this dissertation. 1.1 Multicore and system-on-chip Multicore chips represent a paradigm shift in chip architecture, where several processing cores are integrated on the same chip to leverage parallel processing capabilities. For instance, IBM Cyclops-64 chip includes 160 processing cores integrated on the same chip [2]. By incorporating multiple cores, multicore chips enable concurrent execution of tasks, thereby significantly boosting computational throughput and efficiency. The concept of multicore chips extends beyond homoge- neous architectures to embrace heterogeneity, exemplified by the advent of System-on-Chip (SoC) designs. In SoC implementations, diverse processing units such as central processing units (CPUs), graphics processing units (GPUs), caches, and specialized accelerators are integrated onto a sin- gle chip. This heterogeneous architecture facilitates efficient utilization of resources by allocating specific tasks to the most suitable processing element, thus optimizing performance for various workloads. The integration of heterogeneous cores within SoCs introduces a new era of compu- tational efficiency and versatility. By leveraging the strengths of different processing units, SoCs empower applications to harness a broader spectrum of computational capabilities, ranging from general-purpose computing tasks handled by CPUs to parallelized data processing facilitated by GPUs and accelerators. Furthermore, the scalability inherent in multicore and heterogeneous chip designs aligns with the ever-growing demands of modern computing applications. As workloads become increasingly complex and diverse, the ability to seamlessly scale computational resources through multicore and heterogeneous architectures becomes paramount in meeting performance requirements while managing power consumption and cost constraints. However, assuming a large number of cores on a single chip, an effective communication paradigm should be designed to achieve high performance. 2 ChipChipChip Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core On chip networkBus (a) (b) (c) Figure 1.1: On chip communication paradigm: (a) In point-to-point communication each core is directly connected to the other core using a physical link, (b) In Bus-based communicaiton the bus is shared between cores, and (c) In Network-on-chip, simultaneous communication between core can be handled. 1.2 On-chip communication Point-to-point communication, where there exists a dedicated physical connection between two cores, is not scalable. Assuming N cores on a chip, providing communication between all cores would require N × (N − 1) direct connections (almost equal to N2 in a large-scale chip). Consequently, point-to-point communication proves impractical for current and future SoCs. The primary concern stems from limitations in the number of metal layers available on a chip’s manu- facturing technology. To alleviate such limitations, traditional bus-based communication architectures have been in- troduced, where one or multiple buses are shared among all cores. However, bus-based commu- nication suffers from scalability issues, both in terms of latency and performance. When a bus is allocated for communication between two cores, other cores sharing the same bus must wait until the communication is complete before proceeding with their own communication tasks. This waiting introduces latency into the communication process, and as the number of cores sharing the same bus increases, so does the latency imposed on the communication packets. To address these scalability concerns in on-chip communication, Networks-on-Chip (NoCs) have been proposed. NoCs offer a more efficient and scalable communication paradigm compared to traditional point-to-point and bus-based architectures. These communication paradigms are compared in Figure 1.1. 3 Chip Core Core Core Core Network-on-Chip RouterRouter Router Router Link (a) (b) Figure 1.2: (a) a Networks-on-Chip with four cores, and (b) the corresponding topology of this network. (a) (b) (c) Figure 1.3: Network-on-Chip topology examples: (a) mesh, (b) torus, and (c) butterfly. 1.3 Networks-on-Chip In comparison with conventional communication paradigms such as point-to-point and bus- based communication, NoCs provide efficient communication for on chip systems because they present high performance, low power consumption, and more scalability. NoCs provide com- munication for processing elements employing routers which are connected using interconnects. NoC has emerged as a prevalent solution to enable scalable on-chip communication in manycore systems [3]. The Figure 1.2 illustrates a NoCs architecture featuring four cores, along with the corresponding topology depicting the interconnection scheme between these cores. 4 1.3.1 Topology NoC architectures provide versatile communication flexibility through a variety of topologies, which dictate how routers are interconnected. These topologies, exemplified by mesh, torus, and butterfly configurations in Figure 1.3, significantly influence network performance based on spe- cific application requirements. The mesh topology establishes a grid-like interconnection, enabling horizontal and vertical communication among cores. Mesh is favored for its simplicity, as its pla- nar topology reduces fabrication costs and ensures uniform link lengths, facilitating low worst-case link delay for high-frequency operation. However, it suffers from a large diameter and average dis- tance between nodes, particularly at the network’s edges where connectivity is lower. To address this, the torus topology forms a toroidal structure, linking outermost cores, thus enhancing com- munication efficiency and reducing latency. Nonetheless, torus configurations may incur longer link delays and increased complexity. Meanwhile, the butterfly topology organizes cores hier- archically, enabling parallel communication paths resembling a butterfly’s wings. These diverse topologies present trade-offs in scalability, latency, and power consumption, allowing designers to customize NoC architectures to meet specific SoC design needs. Additionally, NoC topolo- gies, categorized into standard and application-specific configurations, determine router and core interconnections. Standard topologies ensure connectivity between all routers, while optimized configurations match traffic patterns for improved performance and reduced costs. Mesh and torus configurations are prevalent in standard setups, offering simplicity and efficiency, while specialized configurations cater to specific application requirements, enhancing performance at the expense of planar properties or complexity. 1.3.2 Router architecture A router in an NoC, as shown in Figure 1.4, comprises the following main components: • Buffers: These buffers are of the first-in-first-out (FIFO) type and are used to store packets passing through the router. Routers can have input buffers, output buffers, or both. For exam- ple, an input-buffered router stores incoming data first and then, after routing mechanisms, 5 Switch Routing and Arbitration Buffers In p u t L in k s Local Core Link Controller O u tp u t Li n k s Figure 1.4: NoC router architecture. switch allocation, and link assignment, sends the data to the output link. In the depicted model in Figure 1.4, both input and output channels have buffers. • Switch: This unit is responsible for connecting input buffers or links of the router to the corresponding output buffers or links. • Routing and Arbitration Unit: This unit implements routing algorithms, uses the output link for an input packet, and sets the crossbar accordingly. If multiple packets simultaneously request the use of the same output link, this unit arbitrates between them. If the link is busy, the input packet waits in the input buffer, and eventually, after the link becomes available again, it is routed. • Link Controller: The flow of packets in the physical channel between neighboring routers is managed by the link controller unit. A link controller is placed on both sides of a channel to transmit flow control units. 6 1.3.3 Switching Data communication in NoC involves dividing data intended for transmission between cores into packets, which are further divided into smaller units called flits. Each packet is then divided into even smaller units called fits, determining the width of the communication channel, typically considered equal to a flit. Switching can be categorized based on temporal behavior and packet storage methods [4]: Circuit-switched Routing This method involves reserving a path (circuit) from the source to the destination before data transmission. All packets of a message utilize the same path. A route-request is sent from the source to the destination to establish the circuit, allocating channels for packet transmission. Cir- cuit channels remain dedicated to the source-destination pair until data transmission ends, after which resources are released. However, circuit construction overhead and difficulty in finding circuits during network congestion are drawbacks. Packet-based Routing This method does not require route establishment before data transmission. Routing is per- formed separately for each packet, potentially leading to different routes for different packets of the same message. Packet-based routing includes three types: • Store-and-Forward: Upon reaching a node, the entire packet is stored in the buffer, and routing decisions are made based on decoded routing information. If the next router can accommodate all packets, the packet is forwarded without complete reception. This method ensures accurate routing but may lead to higher latency. • Direct Virtual Cut-Through: Routing information is placed in the header of the first flit. Only the first flit is stored, and routing decisions are made based on this information, enabling faster transmission compared to store-and-forward routing. 7 (a) X Y (North) S1 D1 (b) West-north turn Figure 1.5: Routing example: (a) XY routing. (b) turn model of XY routing • Wormhole switching involves breaking data into small packets, each with a control header for routing. These packets can be transmitted through different paths, reducing buffer usage compared to direct virtual cut-through. Wormhole switching is widely regarded as the most popular and efficient switching method in NoC, and it will be the primary switching method discussed throughout the remainder of this dissertation. However, it’s important to note that the likelihood of deadlocks, or infinite cyclic dependencies in the network, increases signif- icantly with this method, necessitating further discussion and the exploration of mitigation strategies in subsequent chapters. 1.3.4 Routing Routing in NoC architectures dictates how packets traverse intermediate routers from a source router to a destination router. Figure 1.5 provides an illustration of an XY routing example, show- casing the routing path taken by data packets through a mesh topology. It also depicts the potential turns resulting from XY routing. For instance, when routing from S1 to D1, the west-north turn is taken, as shown in the figure. However, since YX routing is not implemented, the dashed-red turns are not taken, effectively breaking a cycle. This cycle-breaking mechanism is crucial for avoiding deadlocks, a topic that will be discussed in more detail later in this section. Routing strategies are categorized based on several criteria. Firstly, routing can be either dis- tributed or centralized. In distributed routing, each router makes decisions locally based on packet information, while in centralized routing, decisions are made for the entire network from specific 8 locations. Additionally, routing can be single-destination or multi-destination, allowing one sender to transmit data to multiple receivers. Furthermore, routing can be oblivious, with no knowledge of network conditions, or adaptive, adjusting routing paths based on network conditions, known as adaptive routing. In adaptive routing, decision-making can occur locally at each node. Routers play a crucial role in routing by determining how to forward inputs to outputs using specific algo- rithms. Each router utilizes routing information contained in the packet header to route the packet to the appropriate output. Despite these processes, routing in NoC architectures can encounter challenges leading to packet loss or delays. The three main concerns in routing are deadlock, livelock, and starvation. Deadlock occurs when packets wait for each other to proceed, resulting in cyclical dependencies and network sat- uration. Livelock arises when packets circulate without reaching their destination. Starvation can occur when packets with lower priorities never reach their destination due to resource allocation to higher-priority packets. One of the most important aspects of designing a deadlock-free routing al- gorithm is the demand for cycle breaking. The cyclic dependency between buffer of routers results in infinite waiting between channels and deteriorates network performance. Deadlock mitigation methods are broadly divided into two categories: turn model-based and virtual channel-based rout- ing algorithms. 1.3.5 Deadlock recovery Deadlock recovery in NoC architectures refers to the process of detecting and resolving dead- lock situations where multiple routers are blocked, unable to proceed due to circular dependencies in buffer allocation. While deadlock recovery mechanisms can help revive a system from such situation, it is critical to avoid deadlock altogether rather than relying solely on recovery strategies. Deadlocks can bring communication to a grinding halt, rendering the entire system non-functional. Therefore, it is imperative to prioritize the prevention of deadlock rather than relying solely on re- covery strategies. 9 (a) (b) X Y (North) (c) Figure 1.6: Deadlock free turn model: (a) west-first turn model, (b) north-last turn model, and (c) negative- first turn model. Deadlock recovery mechanisms, while useful as a last resort, come with inherent drawbacks. They introduce complexity and overhead to the system, which can degrade performance and effi- ciency. Additionally, the time taken for deadlock recovery can lead to significant delays, particu- larly in time-sensitive applications, impacting system responsiveness and reliability. Moreover, relying on recovery strategies alone does not address the root cause of deadlock. It merely resolves immediate deadlock situations without guaranteeing prevention of future occur- rences. This can result in a vicious cycle of deadlock detection and recovery, leading to system instability and inefficiency. Therefore, the focus should be on implementing deadlock avoidance strategies as a fundamen- tal aspect of NoC design. By prioritizing deadlock avoidance, designers can create more robust and reliable systems that operate seamlessly without the need for frequent recovery interventions. This approach ensures the continuous and uninterrupted flow of data within the NoC, safeguarding system performance and functionality. 1.3.6 Deadlock avoidance Turn Model As we discussed the turns in XY routing, resulted in deadlock-free routing since the resulting turns cannot make a cycle. The turn model algorithm determines that one or more turns during packet routing are not allowed, therefore cutting off cyclic dependencies between channels. Three famous models have been described for this method, which are shown in Figure 1.6: 10 (a) (b) X Y (North) (c) Figure 1.7: Deadlock turn model examples: (a) the turns in the right-hand cycle create cyclic dependency, (b) the combination of the turns in both cycles can create cyclic dependency, and (c) an example of the combination from the avoided turn in b resulted in a deadlock cycle • West-first Turn Model: In this turn model, all the turn towards the west are prohibited. while this rule breaks any possible cyclic dependency, a packet that moves westward cannot be routed adaptively. • North-Last Turn Model: In this turn model, similarly, a packet that moves northward cannot be routed adaptively routed. In other words, after moving northward, it cannot turn anymore. • Negative-first Turn Model: In this turn model, a packet that moves in the positive direction cannot turn in the negative direction. As it can be seen, by only avoiding one turn among the four turns in each cycle, deadlock can be avoided. However, there are cases where avoiding one turn per cycle cannot break the cyclic dependency, as shown in Figure 1.7 Virtual Channels Flow control mechanisms between two routers are such that packets between two routers is stored in the input and output buffers of each channel (link), and passage through the physical channel only occurs at certain time clocks. Since the buffers are FIFO, the packets that enter first is prepared for transmission to the next path before others; therefore, when a packet occupies the buffers of a channel, another packet, even when physically free, cannot access the channel. In this situation, by adding a buffer next to the buffers of each channel, it is possible to use the physical 11 (b)(a) X Y (North) S1 D1 D2 S2 Figure 1.8: Sub network to achieve deadlock freedom while offering fully adaptive routing: (a) adaptive routing from south to north (e.g., S1 to D1, and (b) adaptive routing from north to south (e.g., S2 to D2.) channel, effectively sharing the physical channel among virtual channels. In this case, it is possible to save on hardware overhead by adding virtual channels. By adding virtual channels, the flow of each virtual channel is separated, and different turns can be added to virtual channels. In this case, the turn model is examined separately for each virtual channel, and it is possible to transfer information from one channel to another as long as cyclic dependencies between virtual channels are not created. For example, if there are two virtual channels, it is possible to send a packet from the first channel to the second, provided that a packet from the second channel is not sent to the first. Therefore, virtual channels not only improve network performance but also potentially can break the cyclic dependencies and prevent deadlock at the cost of hardware overhead due to the addition of buffers. For instance, two subnetworks presented in Figure 1.8 can prevent deadlock and offer fully adaptive routing, if one virtual channel is assigned to each sub network. Therefore, following the routing rules (avoided/allowed turns) in each sub-network, deadlock freedom is guaranteed. 1.3.7 3D Networks-on-Chip Conventional two-dimensional NoC (2D NoCs) suffer from low scalability to support low- latency communication of a large number of on-chip processing cores. To address this challenge, three-dimensional NoCs (3D NoCs) have been proposed to offer high-performance communication for modern computers [5]. With the advances in three-dimensional (3D) integration technologies, systems with stacked dies are interconnected using Through-Silicon Vias (TSVs), further improv- 12 ing NoC scalability, integration density, and system heterogeneity [6–8]. From a network per- spective, 3D topologies reduce the average inter-node distance and save energy consumption [5]. However, connecting the layers of 3D chips necessitates the use of inefficient vertical link tech- nologies. The most promising vertical link technology is TSV, although it introduces high cost and low reliable links. Partially connected NoCs (PCNoCs) have been proposed to alleviate the high cost of vertical links by utilizing a smaller number of TSVs. In 3D NoCs, each vertical link (a.k.a. elevator) includes tens or even hundreds of TSV wires. Therefore, removing some elevators greatly lightens the fabrication cost of the NoCs while imposing a small latency overhead. How- ever, as some links are eliminated, the network topology becomes an irregular one, which makes the routing process more complex. Elevators are shared among routers, and each router selects an elevator to route the packet to another layer. The challenges of 3D NoCs include high fabrication costs, power density issues, and limited cooling conductivity, leading to thermal hotspots and reliability concerns [9,10]. To mitigate these challenges, 2.5D NoCs (a.k.a. chiplet systems), offer a compelling alternative. In a 2.5D NoCs, chiplets are integrated on an interposer, enabling inter-chiplet communication while maintaining modularity. This approach not only addresses the drawbacks of 3D integration but also facilitates improved manufacturing yield by disintegrating large chips into smaller, interconnected chiplets. With the continuous demand for high compute performance and parallelism in data-intensive ap- plications, 2.5D chiplet systems present a scalable solution, allowing for easier integration of pre- designed chiplets and speeding up time-to-market for electronics designs. 1.3.8 2.5D Networks-on-Chip (chiplet systems) The continuous expansion of data- and compute-intensive applications, such as big data analyt- ics and deep learning, has spurred the development of large-scale chips with heightened compute performance and extensive parallelism. These chips, featuring numerous processing cores ranging from several tens to hundreds, provide the computational prowess necessary for emerging applica- tions but are hindered by low manufacturing yields owing to their substantial chip size. In response 13 Interposer Chiplet Chiplet Chiplet Chiplet Chiplet Core Core Core Router Router Router Core Core Core Router Router Router Vertical link (microbumps) Horizontal link Figure 1.9: An example of a 2.5D chiplet system with four chiplets connected through an interposer. to this challenge, 2.5D chiplet systems have emerged, where a large chip is fragmented into sev- eral smaller chiplets interconnected via an inter-chiplet interposer network, substantially enhancing collective manufacturing yield. Illustrated in Figure 1.9, a typical 2.5D chiplet system integrates multiple chiplets on an interposer, facilitating inter-chiplet communication. Each chiplet typically incorporates multiple processing cores interconnected by an intra-chiplet mesh-based NoC. This modular approach allows for the assembly of chiplets into new systems, effectively addressing design challenges, particularly time to market considerations, in the electronics domain. Moreover, the 2.5D integrated approach offers a modular solution by housing several chiplets on an interposer that supports inter-chiplet communication. Similar to 3D SoCs, such modular integration enables heterogeneous and independent design of each chiplet. Furthermore, the design cycle for chiplet systems is notably shortened compared to traditional monolithic designs, as off- the-shelf chiplets can be readily integrated on an interposer to create chiplet systems with varying computation capabilities tailored for different applications. Consequently, chiplet-based systems can harness diverse manufacturing processes and technologies for each chiplet, amplifying the overall flexibility and efficiency of the system architecture. Given the escalating demands for high bandwidth and low latency in modern computing tasks, particularly in data- and compute-intensive applications, the adoption of optical communication on the interposer emerges as a compelling solution to meet these stringent requirements effectively. 14 Silicon photonic interposer Writer gateway (on chiplet) Reader gateway (on chiplet) MR modulators Grating coupler data clk Modulator driver PDPDPD data clk MR filters O p ti ca l fi b e r Photodiode Unmodulated signals Modulated signals Waveguide Off-chip laser Figure 1.10: Data transmission on a silicon photonic interposer. 1.4 Silicon Photonic Networks-on-Chip Electronic interconnects in most of the designs (e.g designs with long interconnects) results in insufficient energy efficiency, high latency and low bandwidth. In such conventional designs, energy efficiency is limited within the distance of smaller than 300mm [11]. Moreover, some effects like skin effect deteriorate into higher losses; and wiring density results in losses because of resistance and capacitance of wires. As a result, conventional electronic interconnect is not a promising solution for the future of NoCs [12]. Silicon photonic communication offers a paradigm shift in on-chip data transfer, particularly advantageous for scenarios demanding high bandwidth and low energy consumption. While con- ventional metallic-based NoCs suffice for intra-chiplet communication within small-scale designs, silicon photonic solutions excel in inter-chiplet communication, especially on platforms like in- terposers. Leveraging silicon photonic technology on an interposer enables the provision of high- bandwidth communication among chiplets over relatively long distances on the interposer itself. This capability is particularly pertinent in domains such as machine learning accelerators, where vast amounts of data need to be rapidly shuttled between different processing units. Our proposed silicon photonic interposers, detailed later in subsequent chapters, aim to address these communi- cation challenges efficiently. 15 In contrast to metallic-based interconnects, silicon photonic communication offers several key advantages. Firstly, it allows for significantly higher bandwidth transmission over longer distances while consuming minimal power. This efficiency is particularly crucial in modern computing archi- tectures, where the demand for data transfer rates continues to escalate. Furthermore, the integra- tion of silicon photonic components onto existing semiconductor fabrication processes facilitates seamless adoption without requiring substantial infrastructure changes. These factors collectively position silicon photonic communication as a compelling solution for future-generation comput- ing systems, especially in scenarios necessitating high-speed, low-latency data exchange among disparate chiplets. Fig. 1.10 shows an example of data transmission between two chiplets that are placed on a silicon-photonic interposer. On the interposer, some modulators, filters, and photodiodes (PDs) are used to perform electro-optical and opto-electrical data conversions. In this dissertation, we consider microring resonator (MR) devices [13, 14], for modulators and filters, due to their area and power efficiency. We define a gateway as an electronic circuit on a chiplet, which controls the modulators (i.e., writer gateway in Fig. 1.10) and PDs (i.e., reader gateway in Fig. 1.10) on the interposer. Moreover, gateways receive/send data from/to the routers on the same chiplet. As shown in Fig. 1.10, optical signals with different wavelengths are generated in an off-chip laser source (green, red, and blue wavelengths). The optical signals are then coupled to the waveguide on the photonic interposer using an optical fiber and a grating coupler. At the writer gateway, MRs modulate electronic data on the optical signals and, at the reader gateway, each MR filters its corresponding optical signal to be detected by the PD. Note that each MR device is designed to resonate at (i.e., couple with) a specific wavelength. As a result, several wavelengths can transmit bits of data at the same time, over the same waveguide; this technique is called WDM. 16 1.5 NoC design challenges in 2.5D and 3D chip technologies 1.5.1 Network latency and traffic distribution Network latency is one of the primary design challenges for NoCs. As the number of cores and processing elements in a chip increases, so does the amount of data traffic between them. This traffic must be managed efficiently to prevent delays and ensure that each core can access the data it needs in a timely manner. 1.5.2 Fault tolerance Faults in NoC architectures are categorized into three main types based on their duration and impact: permanent faults, transient faults, and periodic faults. Permanent faults persist in the system until repaired, while transient faults occur under specific conditions and last for a limited period. Periodic faults, such as thermal faults, recur at regular intervals and may render the af- fected component temporarily unusable until rectified. These faults can affect the performance and reliability of NoC systems, which are susceptible to various failure mechanisms influenced by environmental conditions and process variations. Radiation is a significant source of faults, originating from cosmic neutrons and alpha parti- cles, which can induce Single Event Upsets (SEUs) or Single Event Transients (SETs) in memory elements or logic gates. Electrostatic discharge can break electrical bonds in electronic compo- nents, leading to dielectric, PN junction, or interconnect breakdowns, exacerbated by increased heat generation and leakage current. Aging effects, accelerated by heat, can degrade CMOS hard- ware performance and increase failure rates, particularly in interconnects. Electrical migration, driven by high current density, causes thinning of metal contacts over time, leading to timing fail- ures or circuit malfunction. Heat-induced faults, especially prevalent in submicron technologies and three-dimensional on-chip networks, result from temperature-induced variations in transistor characteristics, affecting speed, power consumption, and reliability. To cope with faults, three main methods are employed: fault avoidance, fault hiding, and fault tolerance. Fault avoidance aims to prevent fault progression through testing, and quality control 17 methods. Fault hiding reduces the severity or number of fault effects, often by incorporating re- dundancy in the system. Fault tolerance enables the system to continue operating despite faults, achieved through fault masking, detection and relocation, or routing algorithms that minimize ex- posure to faults. These coping methods are essential for ensuring the reliable performance of NoC architectures in the face of potential faults and failures. Fault-Tolerant Routing algorithms ensure network reliability in the face of faults, selecting alternative paths to avoid disruptions caused by faults. These algorithms employ techniques such as fault regions and turn models to mitigate the impact of faults on network performance and reliability. 1.5.3 Hardware security Hardware security focuses on protecting computing systems from physical attacks and vulner- abilities in the underlying hardware components. In this disseration we specifically focus on the security of interposers in 2.5D chiplet systems. This is crucial because the interposer is the cen- tral communication hub for all chiplets, and if compromised, could bring the entire system down. Security is especially important in chiplet systems compared to traditional monolithic chips due to several reasons. First, chiplet systems often integrate components from various sources, some of which might be untrusted. These untrusted chiplets could potentially launch Denial-of-Service (DoS) attacks by flooding the interposer with traffic, making it unavailable for legitimate communi- cation. Second, chiplet systems rely on vertical links for communication between chiplets stacked on top of the interposer. These vertical links are expensive and critical for system performance, making them prime targets for attackers. 1.5.4 Energy efficiency Energy efficiency is another critical concern for chip designers, especially for NoCs. As the number of cores on a chip increases, so does the power required to transmit data between them. NoC designers must carefully consider energy consumption when designing communication ar- chitectures to ensure optimal performance without excessive power demands. Energy efficiency 18 is paramount in 2.5D chiplet systems. These systems integrate multiple chiplets to overcome the limitations of large-scale chip fabrication. While this approach offers benefits, it also leads to a higher traffic load on the interposer network. To handle this increased traffic, solutions like silicon photonic communication have been proposed for high-bandwidth inter-chiplet communication. However, silicon photonics, despite its advantages, suffers from high power consumption in inter- poser networks. Therefore, designing energy-efficient 2.5D chiplet systems is critical to minimize power usage and ensure system reliability, even with the increased traffic demands. 1.5.5 Scalability Scalability is another crucial aspect of NoC design. As the number of cores on a chip, and po- tentially the number of chiplets in a system, continues to increase in the future, NoCs must be able to adapt and handle this growth. Designers employ various techniques to achieve scalability. One approach is using hierarchical topologies, where smaller networks are combined to form larger ones, such as integrating 2.5D and 3D networks. however, dynamic routing algorithms and recon- figurable architectures that can adjust to changes in the network’s configuration are necessary to be designed. Additionally, NoC virtualization allows combining multiple NoCs into a single virtual network, enabling scaling across multiple chips or even entire systems. 1.6 Dissertation motivation The motivation for this dissertation stems from the urgent need to tackle key challenges in NoC architectures, specifically focusing on 3D and 2.5D topologies and interposer designs, while addressing latency, energy efficiency, fault tolerance, and security concerns. This research endeav- ors to overcome the limitations of 3D NoCs, including high fabrication costs and reliability issues related to TSV technology, by introducing an adaptive congestion- and energy-aware elevator- selection algorithm to mitigate latency and enhance energy efficiency. Furthermore, the disserta- tion aims to enhance the reliability and reduce latency in 2.5D chiplet systems through the devel- opment of fault-tolerant routing algorithms. The increasing significance of heterogeneous 2.5D 19 integration, facilitating seamless chiplet integration to reduce design time and costs, underscores the need for ensuring the security and reliability of the Network-on-Interposer (NoI). Address- ing this imperative, the dissertation also seeks to develop a secure routing algorithm to safeguard chiplet systems against Denial-of-Service (DDoS) attacks, an aspect yet to be adequately addressed in current research. Additionally, the dissertation aims to contribute novel solutions to further en- hance the efficiency of chiplet systems, considering silicon photonic NoIs. The proposed photonic NoI with PCM for energy-efficient interposer communication, aims to mitigate power consumption in interposer-based photonic networks, providing lower latency, reduced power consumption, and enhanced energy minimization compared to existing photonic networks. Furthermore, 2.5D accel- erators requires efficient interposer networks for low-latency and energy-efficient communication, facilitating parallel Multiply and Accumulation (MAC) operations. 1.7 Dissertation outline The dissertation explores various aspects of interconnect architectures for emerging 3D and 2.5D topologies. Chapter 2 presents an adaptive congestion-and-energy-aware elevator selections framework for Partially Connected 3D NoCs (PC-3DNoCs), proposing an algorithm for efficient elevator selection. Chapter 3 delves into a reliable and deadlock-free routing approach for 2.5D Chiplet-based Interposer Networks, introducing a routing algorithm to ensure reliable communi- cation in active interposers. In Chapter 4, a reconfigurable Silicon-Photonic 2.5D Chiplet Network with Phase Change Materials (PCMs) is outlined, emphasizing energy-efficient communication through the ReSiPI architecture. Chapter 5 extends the discussion to machine learning accelera- tors in 2.5D chiplet platforms, leveraging ReSiPI-based Silicon Photonic Interposers for improved performance. Chapter 6 introduces a non-blocking switch-based silicon photonic interposer net- work for 2.5D machine learning accelerators, showcasing the SwInt architecture for efficient data transfer. Chapter 7 introduces a multi-objective routing framework designed to secure chiplet sys- tems against distributed denial of service (DDoS) attacks. Through these chapters, the dissertation aims to provide comprehensive insights into advanced interconnect technologies and their applica- 20 tions in next-generation chiplet-based systems while outlining avenues for future research. Finally, Chapter 8 summarizes this dissertation and provides some suggestions for future work. 21 Chapter 2 Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Connected 3D NoCs Unlike conventional 2D Networks-on-Chip (NoCs), 3D NoCs offer a scalable and energy ef- ficient on-chip communication. Vertical die stacking of 3D NoCs is enabled using inter-layer Through-Silicon Via (TSV) links. However, TSV technology suffers from low reliability and high fabrication costs. To mitigate these costs, Partially Connected 3D NoCs (PC-3DNoCs), which use fewer vertical TSV links, have been introduced. However, with fewer vertical links (a.k.a. elevators), elevator-less routers will have to send their traffic to nearby elevators for inter-layer traffic, increasing the traffic load and congestion at these elevators and potentially reducing perfor- mance. Therefore, it is important that elevator-less routers choose elevators that balance the traffic load among the available elevators. To address this problem, this chapter presents an adaptive congestion- and energy-aware elevator-selection algorithm, called AdEle+. AdEle+ employs an offline multi-objective simulated-annealing-based optimization to find good elevator subsets for routers. In addition, during high traffic loads, AdEle+ uses an adaptive and online elevator selec- tion algorithm to select an elevator from the elevator subset to dynamically manage traffic con- gestion on elevators. Moreover, in low congestion circumstances, AdEle+ switches to a minimal distance selection to improve energy efficiency. Compared to state-of-the-art selection algorithms under various PC-3DNoC configurations and traffic patterns, AdEle+ reduces the average latency by 9.5% on average and up to 11.2% while reducing the hardware overhead by 10.1% due to its efficient online selection in the routers. 2.1 Introduction To create vertical links in TSV-based 3D NoCs, multiple TSVs are grouped into a bundle. However, due to the large TSV interconnect pitch and keep-out-zone requirements [15], these TSV 22 bundles result in large area overhead. Moreover, TSVs are particularly susceptible to electromigra- tion and capacitive crosstalk-induced issues [16,17]. Therefore, it is very costly to have TSV-based vertical links at every router [6,7]. To address these challenges, Partially Connected 3D NoCs (PC- 3DNoCs) with TSV-based vertical links at only some routers have been proposed [1, 7, 18]. In addition to reduced fabrication costs, the PC-3DNoC paradigm can be utilized to handle missing vertical links due to TSV-based faults [19] and to improve manufacturing yields. Since PC-3DNoCs remove some of the vertical links (a.k.a. elevators), the remaining elevators must be shared among multiple routers. A well-known routing solution in PC-3DNoCs, Elevator- First routing [1], naïvely select the nearest elevator without considering the overall network traffic, potentially creating traffic hotspots at certain elevators and increasing the network latency [7]. To mitigate the effects of these traffic hotspots, we can balance the traffic across all elevators using an adaptive routing technique that selects elevators based on elevator utilization. For example, Congestion-aware Dynamic elevator Assignment (CDA) [20] uses global traffic information to improve the elevator load distribution during runtime. Elevator-selection algorithms in PC-3DNoCs can be broadly classified into design-time and runtime approaches. For example, in [1, 21], each router is assigned one elevator for all verti- cal traffic at design time and to optimize network performance (e.g., the network latency). On the other hand, online approaches like [22] monitor traffic and select elevators during runtime to help balance the elevator load. Since design-time approaches do most of the calculations offline, only simple look-up tables are required, oftentimes resulting in simpler and faster implementations compared to an online approach. However, these offline approaches [1,21,23–25] rely only on the traffic information available during design time and cannot adapt to new runtime traffic scenarios or changing system conditions, e.g., faults. Although online solutions can help adjust to these changes, they can impose large overhead. For example, in CDA [20, 22], gathering global traf- fic information from distant routers and determining the selection at runtime impose significant hardware and latency overhead. 23 This chapter addresses the elevator-selection problem in PC-3DNoC routing techniques by de- veloping, a novel, congestion- and energy-aware adaptive elevator-selection scheme called AdEle+. The main contributions of AdEle+ are summarized in the following: • We propose AdEle+, a two-stage elevator-selection approach that takes advantage of both design-time and runtime benefits: AdEle+ integrates a design-time elevator-set optimization and a runtime elevator-selection policy to balance traffic with minimal overhead. • We utilize a Multi-Objective Simulated-Annealing-based optimization algorithm (AMOSA [26]) offiline for the design-time elevator-set optimization. AMOSA chooses, for each router, an optimized subset of elevators that will be used during runtime selection to simplify the online elevator selection. • We utilize a Multi-Objective Simulated-Annealing-based optimization algorithm (AMOSA [26]) offiline for the design-time elevator-set optimization. AMOSA chooses, for each router, an optimized subset of elevators that will be used during runtime selection to simplify the online elevator selection. • We develop a low-cost runtime elevator selection that has two modes: one for high traffic loads using local traffic information and a simple modified round-robin technique to improve the elevator load; and another for low traffic loads using a distance-based elevator selection that considers the elevator distance to both source and destination. • An algorithm is presented to find the optimized threshold for switching between the different runtime elevator selection modes. Our simulation results, under different synthetic and real-application traffics with various PC- 3DNoC configurations, show the promise of AdEle+ compared to state-of-the-art. For instance, AdEle+ improves the network latency by 10%, on average, and by up to 13.9%. AdEle+ also reduces the energy consumption in low traffic scenarios. Moreover, due to the local traffic mon- itoring of AdEle+, hardware overhead is reduced, compared to CDA selection, in which global traffic information is shared among routers. 24 Minimal path Non-minimal path D S e1 e2 e3 E lev ato r Elevator router Elevator-less router (a) (b) Figure 2.1: (a) An example PC-3DNoC with three elevators (e1, e2, and e3). The routing path from S to D based on Elevator-First algorithm [1] (dotted-red line) and the minimal path (blue-solid line) are shown. The middle-layer routers are colored based on their Elevator-First selected elevator. (b) Traffic load on each router in the middle layer: the e2 elevator is highly congested because of the inefficient elevator selection in Elevator-First algorithm (7 out of 16 routers use this elevator router). The rest of the chapter is organized as follows. We present the background and review some prior related work on PC-3DNoCs in Section 2.2. Section 2.3 discusses the elevator-selection prob- lem and its complexity in PC-3DNoCs. Moreover, it details our proposed technique (AdEle+) and its implementation. In Section 2.4, we explore the parameters of AdEle+ to optimize the elevator selection. In addition, Section 2.4 presents our simulation results including latency, energy, power, and hardware. Finally, Section 2.5 concludes the chapter. 2.2 Background and Related Work This section discusses PC-3DNoCs and related challenges in routing, elevator placement, and elevator selection. 2.2.1 Partially Connected 3D Networks-on-Chip The total number of TSVs highly affects the overall cost of 3D chips [27, 28]. Since each elevator link includes tens to hundreds of TSVs (e.g., 2 × 128 bits in [29]), limiting elevator links to only some routers—a Partially Connected 3D NoC (PC-3DNoC)—can help significantly 25 improve the fabrication cost [1, 6, 18, 30]. For example, a PC-3DNoC with elevators at 25% of the routers significantly improves the chip fabrication cost with only 18.7% overhead in the network latency, compared to a fully connected 3D NoC [31]. Moreover, PC-3DNoC schemes can be better extended to deal with TSV faults as they are designed to work in a partially connected network with missing TSV links [19]. Note that our work assumes a given PC-3DNoC topology and thus can handle manufacturing TSV faults. Considering runtime TSV faults are beyond the scope of this chapter. As shown in Fig. 2.1a, PC-3DNoCs consist of two types of routers: elevator-less routers with five ports (east, south, west, north, and local), and elevator routers that include two additional ports (up and down) for a total of seven ports. As elevator-less routers are not directly connected to the other layers, their inter-layer packets must first route to an elevator router. 2.2.2 Elevator Placement and Routing Algorithms Similar to elevator selection during routing in PC-3DNoCs (see Section 2.3), elevator place- ment also plays an important role in the network performance. In [31], the authors proposed an elevator placement that minimizes the hop count from all routers to nearby elevators. It is assumed that elevator-less routers utilize their closest elevator in the layer. Considering this assumption, an optimized placement is found to minimize the average router to elevator distance. However, in runtime network operation, different elevators might be selected for elevator-less routers, which is not consistent with the assumption in the placement optimization. To this end, [20] uses a Genetic Algorithm (GA) to place elevators while considering elevator selection to minimize the average distance and load variance. However, these approaches will likely end up with a non-uniform ele- vator placement, after manufacturing faults, which can lead to non-uniform elevator usage. Even with uniform elevator placement, traffic is typically non-uniform leading to imbalanced elevator utilization. Fortunately, an adaptive elevator selection algorithm can compensate for these effects and balance the traffic over the elevators. 26 Elevator Configuration Traffic Pattern Objective 1: Elevator Utilization Online Elevator Selection (Section ΙΙΙ.C) Offline Optimization (Section ΙΙΙ.B) Elevator-Set Search (AMOSA) Router 0 Router 1 Router N-1 Routing Computation Elevator Selection Enhanced Round Robin (ERR) Switch between DB and ERR Traffic Monitor Elevator Subset Elevator Subsets (one per router) Objective 2 Average Distance Distance Based (DB) Closest elevator (per quadrant) Figure 2.2: An overview of our proposed elevator-selection algorithm: AdEle+. Many routing algorithms have been proposed for PC-3DNoCs [1,6,18,19,24,30,32–34]. They are mainly different in their approach to guarantee deadlock freedom. For instance, in Elevator- First [1], the most popular deadlock-free routing for PC-3DNoCs, upward-bound packets are routed in one virtual channel while downward-bound packets are routed in another virtual channel to break the cyclic dependency and maintain deadlock freedom. However, Elevator-first is a de- terministic routing algorithm and it suffers from a low adaptation in path selection. To this end, LEAD [24] offers an adaptive routing algorithm while guaranteeing deadlock freedom by defining some subnetworks. However, all of these routing algorithms employ a simple selection, where the closest elevator to the source router is used for inter-layer routing. Although simple, this selection results in an unbalanced traffic load over elevators and, therefore, unnecessary network congestion and performance loss. 2.2.3 Elevator Selection Algorithms In PC-3DNoCs, the elevator selection process can highly impact the traffic distribution across the elevators and consequently the performance of the entire network. Unfortunately, elevator selection in PC-3DNoCs has been barely studied in related work. In conventional routing algo- rithms, usually the closest elevator is selected to route inter-layer packets [1, 32, 33]. However, this selection ignores the elevators’ load distribution and can lead to network congestion. This can be especially harmful for PC-3DNoCs with non-uniform elevator placements, small number of elevators, or non-uniform traffic distribution. Adaptive elevator-selection techniques have been proposed [6, 18, 30], but they mainly focus on designing fault-tolerant approaches to handle eleva- 27 tor failures. These strategies select the closest non-faulty elevator to the source without considering the elevator’s load and congestion. To improve the traffic distribution in PC-3DNoCs, [21] proposed an offline optimized elevator- selection algorithm using the Tabu search algorithm to distribute the load over elevator links and reduce network latency. However, this approach does not consider network energy and cannot capture the dynamics of the runtime network traffic due to its offline nature. In [24], a simple online selection strategy was presented where routers select one elevator randomly. This random selection improves the elevator traffic distribution compared to the closest elevator selection. However, it increases the average hop count and energy consumption as some packets will have to travel much farther to distant elevators. In [20] and [22], an online Congestion-aware Dynamic elevator Assignment (CDA) was proposed. CDA selects the elevator that minimizes the sum of the delay and the buffer utilization of the routers between the packet’s source and the elevator. However, CDA requires online global information of the routers’ delay and buffer utilization, sharing of which imposes high overhead. Considering the aforementioned efforts, an efficient elevator-selection solution is yet to be re- alized for PC-3DNoCs. In the rest of this chapter, we propose an Adaptive congestion- and energy- aware Elevator-selection algorithm (AdEle) to optimize a subset of elevators for each router, then leverages such subsets in runtime to dynamically avoid congested elevators and improve traffic while relying only on local information. Employing a set of elevators instead of one elevator per router enables the online selection to adapt to new traffic patterns, hence improving the network performance. Moreover, we realize a low-cost Distance-Based (DB) elevator selection for the low congestion scenarios to improve the energy efficiency. In addition, an algorithm is proposed for switching between the traffic-aware and distance-based selections to enable AdEle+ to account for a dynamic traffic load. We explore design parameters of AdEle+ and optimize them to signifi- cantly improve the performance. We also show the effectiveness of the distance-based elevator selection, under low traffic load scenarios, and the dynamic switching between the distance-based and congestion aware selections. 28 2.3 Proposed Elevator-Selection Algorithm: AdEle+ This section discusses the main challenges for elevator selection in PC-3DNoCs and details our proposed adaptive congestion- and energy-aware elevator-selection algorithm, AdEle+. As an overview, Fig. 2.2 shows the building blocks of the proposed algorithm: AdEle+ uses an offline multi-objective simulated-annealing-based algorithm (AMOSA) to find an optimal subset of ele- vators for each router and an online elevator-selection algorithm to then select the best elevator from the subset in the presence of runtime traffic. 2.3.1 Motivation: Routing in PC-3DNoCs In PC-3DNoCs, the routing process requires three main steps because of the irregular topology: 1) selecting a vertical link (elevator) for each packet in the source router and then routing the packet to that elevator on the source layer; 2) vertically routing the packet to the destination layer; and 3) routing the packet from the elevator to the destination node on the destination layer. In PC- 3DNoCs, the elevator selection (the first step) is critical as elevators quickly become the network bottleneck due to the smaller number of elevators. As we will show, AdEle+ optimizes the elevator selection in the first step to balance the traffic over the elevators, thereby reducing network traffic hot-spots. Fig. 2.1a shows an example of a PC-3DNoC with three elevators (e1–e3). Router tiles are colored with the elevator’s color they would use under the Elevator-First policy (i.e., the closest elevator to the source router is selected) [1,6,18], i.e., four routers will use the green (e1) elevator, seven will use the blue (e2) elevator, and five will use the red (e3) elevator. Unfortunately, such an unbalanced elevator selection can put severe traffic pressure on certain elevators (e2 in this ex- ample). Ideally, some of the load on e2 should be assigned to e1 or e3, making e2 less congested. Fig. 2.1b demonstrates the utilization of the middle-layer routers with Elevator-First selection pol- icy under uniform traffic. This confirms that e2 is highly congested due to the unbalanced elevator selection. In terms of energy efficiency in low traffic loads, the best elevator selection is on the min- imal path between the source and destination. However, for the path between S and D in Fig. 2.1a, 29 policies that ignore the location of