Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low-Precision AI Tensor Cores Pedro Valero-Lara, Frank Liu, and Jeffrey S. Vetter Oak Ridge National Laboratory Oak Ridge, USA {valerolarap},{liufy},{vetter},@ornl.gov Ian Jorquera Colorado State University Fort Collins, USA jorquera@colostate.edu ABSTRACT Using NVIDIA graphics processing units (GPUs) equipped with Tensor Cores has enabled the significant acceleration of general matrix multiplication (GEMM) for applications in machine learn- ing (ML) and artificial intelligence (AI) and in high-performance computing (HPC) generally. The use of such power-efficient, spe- cialized accelerators can provide a performance increase between 8× and 20×, albeit with a loss in precision. However, a high level of precision is required in many large scientific and HPC applica- tions, and computing in single or double precision is still necessary for many of these applications to maintain accuracy. Fortunately, mixed-precision methods can be employed to maintain a higher level of numerical precision while also taking advantage of the per- formance increases from computing with lower-precision AI cores. With this in mind, we extend the state of the art by using NVIDIA’s new TF32 framework. This new framework not only burdens some constraints of the previous frameworks, such as costly 32 16-bit castings but also provides an equivalent precision and performance by using a much simpler approach. We also propose a new frame- work called TF64 that attempts double-precision arithmetic with low-precision Tensor Cores. Although this framework does not exist yet, we validated the correctness of this idea and achieved an equivalent of 64-bit precision on 32-bit hardware. CCS CONCEPTS •Hardware→ Hardware test; Analysis and design of emerging devices and systems; • Mathematics of computing→ Numeri- cal analysis. KEYWORDS Mixed Precision, Tensor Core, GEMM, GPUs ACM Reference Format: Pedro Valero-Lara, Frank Liu, and Jeffrey S. Vetter and Ian Jorquera. 2023. Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low- Precision AI Tensor Cores. In Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 2023), November 12–17, 2023, Denver, CO, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3624062.3624084 Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC-W 2023, November 12–17, 2023, Denver, CO, USA © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-0785-8/23/11. . . $15.00 https://doi.org/10.1145/3624062.3624084 1 INTRODUCTION Recently, graphics processing unit (GPU) and central processing unit (CPU) vendors have become focused on accelerating general matrix multiplication (GEMM) with low-precision AI cores that are now available on newer GPUs, including NVIDIA’s Tensor Cores [19], AMD’sMatrix Corescite [2], and ARM’s SME [4], among other specialized architectures [3, 31]. Accelerators equipped with these lower-precision cores outperform traditional CPUs/GPUs in artificial intelligence (AI) or machine learning (ML) workloads (e.g., low- or mixed-precision arithmetic) and are more power effi- cient [6]. These specialized hardware components essentially compute a matrix-matrix multiplication using low-precision operands, which are the principal components ofmultiple AI andML applications [15, 16]. GEMM is defined as the operation𝐶 ← 𝛼𝐴𝐵+𝛽𝐶 , where𝐴 is an 𝑚×𝑘 matrix, 𝐵 is a𝑘×𝑛matrix, and𝐶 is an𝑚×𝑛matrix [18, 40]. The matrices 𝐴, 𝐵, and 𝐶 and the constants 𝛼 and 𝛽 have entries stored as floating point values generally following the IEEE framework for half-precision (FP16-HGEMM), single precision (FP32-SGEMM), or double precision (FP64-DGEMM). Although these components are designed to accelerate AI/ML applications, they can also be used to accelerate other kinds of op- erations, including fast Fourier transform (FFT) [35], linear algebra operations [5, 12, 25], or comparative genomics (the first applica- tion to obtain an exaflop) [26], among many others [14, 22, 24, 37]. Given the high performance achieved by this specialized hardware and the importance of GEMM for multiple math libraries and appli- cations [8, 9, 28, 29, 38, 39], the High-Performance Linpack (HPL) benchmark [36], the standard high-performance computing (HPC) benchmark used for the TOP500 list [17], was adapted to make use of these accelerators (i.e., HPL-AI) [21]. This work extends the analysis conducted byM. Fasi et al. [10] by using the new TF32 format. In that work, the authors studied the use of multiword arithmetic and the FP16 formats on NVIDIA Tensor Cores for single-precision operations. The contribution of this work is twofold: (1) the evaluation of the correctness and performance of the TF32 format, which is a new format that enables fast and low-precision matrix-matrix computation with no expensive 32-bit to 16-bit transformations. Also, the use of this new format does not require additional and costly operations to guarantee correct- ness/precision for single-precision operations, and (2) the proposal and study of a new format called TF64 for high-performance and double-precision Tensor Core–accelerated applications. The rest of the paper is organized as follows: Section 2 intro- duces the main characteristics of the Tensor Core architecture and reviews both the methodologies used to obtain high-precision on 179 https://doi.org/10.1145/3624062.3624084 https://doi.org/10.1145/3624062.3624084 http://crossmark.crossref.org/dialog/?doi=10.1145%2F3624062.3624084&domain=pdf&date_stamp=2023-11-12 SC-W 2023, November 12–17, 2023, Denver, CO, USA Pedro Valero-Lara, Frank Liu, and Jeffrey S. Vetter and Ian Jorquera low-precision hardware (multiword arithmetic) and the different novel and well-known formats that we can use on these AI acceler- ators. We evaluate the correctness of the techniques and formats used in this study in Section 3. Related work is presented in Sec- tion 4. Finally, conclusions and future directions are outlined in Section 5. 2 MIXED-PRECISION S/DGEMM USING TF32 AND TF64 TENSOR CORE FRAMEWORKS 2.1 Tensor Cores Tensor Cores [19] are specialized accelerator cores that compute tensors, which are mathematical objects that describe the relation- ships between other mathematical objects that are linked together. This allows 4 × 4 matrices to be multiplied and added to a 4 × 4 matrix. Novel methodologies have been introduced to improve per- formance by increasing the size of such accelerators. As Tensor Cores have been developed further, they have extended their ca- pabilities to operate on different formats (Figure 1), including low- and mixed-precision arithmetic, which can accelerate AI operations. Table 1 lists the performance of Tensor Cores for different NVIDIA GPU generations. Table 1: Tensor-Core Performance (TFLOPS) on different generations and formats FP64 TF32 BF16 FP16 INT8 Volta V100 - - - 128 - Ampere A100 19.5 156 312 312 624 Hopper H100 67 989 1,979 1,979 3,958 Note that V100, A100, and H100 GPUs provide a wide range of different arithmetics with varying levels of performance. The four- generation tensor cores that equip the Ampere A100 architecture provide more levels of precision than the tensor cores available on its predecessors. Also, different arithmetics have different numbers of computational units on the different NVIDIA GPU architectures, which influences the final throughput. It is important to note that all these new low-precision formats (FP16, TF32, etc.) are not equivalent to IEEE standards floating-point arithmetic. Also, operations on those formats may not satisfy the many IEEE requirements for correct and optimal rounding modes. 2.2 Multiword Arithmetic Throughout this study, we will utilize and apply the multiword arithmetic [10], which stores high-precision floating points as the sum of lower-precision floating points. For example, for TF32, let 𝐴 and 𝐵 be 𝑛 × 𝑛 matrices with entries stored as FP32. Define 𝐴1 := (𝑇𝐹32)𝐴 as the matrix 𝐴, the elements of which have been formatted to TF32, and define 𝐴2 := (𝑇𝐹32) (𝐴 −𝐴1) as the matrix that stores the values lost due to the conversion. This gives us 𝐴 ≈ 𝐴1 +𝐴2. Similarly, we define 𝐵1 := (𝑇𝐹32)𝐵 and 𝐵2 := (𝑇𝐹32) (𝐵 − 𝐵1) in the same way. With the approximations for 𝐴 ≈ 𝐴1 + 𝐴2 and 𝐵 ≈ 𝐵1 + 𝐵2, we can compute the GEMM using the following approximation: 𝐴 · 𝐵 ≈ (𝐴1 +𝐴2) · (𝐵1 + 𝐵2) = 𝐴1𝐵1 +𝐴1𝐵2 +𝐴2𝐵1 +𝐴2𝐵2 Here, we can approximate a single-precision general matrix mul- tiplication (SGEMM) by using four independent matrix multipli- cations, each computed using the TF32 compute mode. Notably, the final product plays a relatively insignificant role in improving precision, and this allows us to accelerate this method by removing the final 𝐴2𝐵2 GEMM. To better understand the operations conducted for multiword arithmetic, we include a simple pseudocode in Figure 2 to illustrate the operations required for this analysis. 2.3 TF32 The TF32 format (Figure 3) adopts 8 exponent bits, 10 bits of man- tissa, and 1 sign bit. This new format covers the same range of values as FP32 and maintains more precision than BF16 and the same amount as FP16. The precision for TF32 has enough margin for AI applications. The TF32 mixed-precision framework (Figure 1) for GEMM takes as input two matrices with entries in single precision (FP32). It then converts them to TF32 and computes the multiplications in full precision. Finally, the accumulation or addition is computed in FP32 to limit any accumulation error [7]. The TF32 mixed-precision framework on the NVIDIA A100 can achieve 156 TFLOPS of peak performance, whereas standard FP32 (non-Tensor Core) achieves 19.5 TFLOPS of peak performance. The result is an 8× performance uplift when using TF32. 2.4 TF64 We also extend this work into double precision. First, we propose a new framework called TF64 (also illustrated in Figure 1) that extends the TF32 mixed-precision framework to double precision; that is, the FP64 inputs are converted to FP32 (or a potential future TF64). Multiplication is then computed in full precision, and accu- mulation is computed in FP64. With adequate hardware support, we assume that many of the same performance increases seen in the TF32 compute mode would also manifest in these potential double-precision frameworks. Although these frameworks do not exist on Tensor Cores, we find value in understanding the potential benefits of mixed-precision methods on these higher-precision frameworks for scientific appli- cations. The method presented by M. Fasi et al. [11] naturally extends to using the new TF64 Tensor Core framework. For 𝐴 and 𝐵 𝑛 × 𝑛 matrices with entries stored as FP64, we define 𝐴1 := (𝐹𝑃32)𝐴, 𝐴2 := (𝐹𝑃32) (𝐴 − 𝐴1), 𝐵1 := (𝐹𝑃32)𝐵, and 𝐵2 := (𝐹𝑃32) (𝐵 − 𝐵1). This gives us the approximations 𝐴 ≈ 𝐴1 + 𝐴2 and 𝐵 ≈ 𝐵1 + 𝐵2, meaning we can approximate a double-precision GEMM,𝐴 ·𝐵, with the sum of four FP32, 𝐴1𝐵1 +𝐴1𝐵2 +𝐴2𝐵1 +𝐴2𝐵2. 3 ANALYSIS For the analysis, cuBLAS mixed-precision kernels (cublasGemmEx) were computed on NVIDIA’s A100 GPU, and kernels labeled as TF64 were computed through software on an Intel Xeon E5-2698 v4 CPU to mimic possible future computing frameworks. 180 Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low-Precision AI Tensor Cores SC-W 2023, November 12–17, 2023, Denver, CO, USA Figure 1: Current and proposed (TF64) Tensor Core frameworks. In bold are the frameworks evaluated in this study. float A, A1, A2, B, B1, B2, C_TF32, C; //Note that for TF32 and TF64, costly casting are not necessary A1 = A; A2 = (A - A1); B1 = B; B2 = (B - B1); C_TF32 = C; cublasGemmEx (A1, B1, C_TF32); cublasGemmEx (A1, B2, C_TF32); cublasGemmEx (A2, B2, C_TF32); // Note that this line is only added for the analysis of correctness cublasGemmEx (A2, B2, C_TF32); cublasSgemm (A, B, C); errorComputaion (C_TF32, C); Figure 2: Example of multiword arithmetic code for TF32. 3.1 Error Computation As others have done in related work (Section 4), we compute the ℓ2 forward error1 on the matrix multiplication 𝐶 = 𝐴𝐵 for random matrices 𝐴, 𝐵, and 𝐶 . The ℓ2 forward error is defined as ℓ2𝑒𝑟𝑟𝑜𝑟 = | |𝐶 −𝐶 | |2 | |𝐴| |2 | |𝐵 | |2 , where𝐶 is the matrix product computed in double or quad precision (FP128 uses the quad math GNU library [1]) for comparison with the TF32 and TF64 mixed-precision frameworks, respectively, and | | · | |2 is the ℓ2 norm. For the sake of completeness [34], we also considered other errors, such as ℓ1: ℓ1𝑒𝑟𝑟𝑜𝑟 = | |𝐶 −𝐶 | |1 | |𝐴| |1 | |𝐵 | |1 and ℓ∞: ℓ∞𝑒𝑟𝑟𝑜𝑟 = | |𝐶 −𝐶 | |∞ | |𝐴| |∞ | |𝐵 | |∞ . 1https://netlib.org/lapack/lug/node75.html 3.2 SGEMM and TF32 First, we look at the TF32 mixed-precision framework with exist- ing hardware support on the A100 (Figure 4). The 1×TF32 mixed- precision framework for GEMM can provide an improvement of about 4× over the 1×FP16 GEMM. However, this improvement is not enough and provides an error of magnitude of 1𝑒−5 − 1𝑒−6. Using the multiword arithmetic and 4×TF32 GEMMs, we achieve an error close to the error of the SGEMM, with magnitudes of er- ror much lower than that of 1×TF32 GEMM or 1×FP16 GEMM. Furthermore, the precision lost by omitting the fourth matrix multi- plication has no noticeable effect on the overall error. Additionally, we have included the |SGEMM-3×TF32| error, as shown on the right side of the graphs. This shows that we can provide a close approximation to that of an SGEMM when using 3×TF32 GEMMs. Based on the performance of the NVIDIA A100 GPUs, we obtain a peak performance of about 53 Tflop/s and a performance uplift of 2.6× over SGEMM when using 3×TF32 GEMMs. This analysis extends the work of M. Fasi et al. [10] by using the new TF32 Tensor Core framework. By using this framework, no costly 32-bit to 16-bit casting or extra memory is necessary to bene- fit from using Tensor Core accelerators. The casting is computed at the hardware level (Figure 1) during the execution of the operations instead of at the software level (Figure 2). We can also provide an equivalent or even better precision by using the simplest multiword arithmetic algorithm variant, and no complex variants or tuning is necessary to improve precision. In terms of performance, the simplicity of the algorithm and code used provides equivalent or even faster performance despite not using the higher-performing FP16 framework (Table 1). 3.3 DGEMM and TF64 We propose another new Tensor Core framework to pursue high performance for double-precisionHPC applications on low-precision AI accelerators. Although this framework does not exist today, we believe these methods will be effective in accelerating matrix mul- tiplication for scientific applications while maintaining a high level of precision. Also, given the current trend in this architecture with an important addition of new and more precise data formats in the 181 https://netlib.org/lapack/lug/node75.html SC-W 2023, November 12–17, 2023, Denver, CO, USA Pedro Valero-Lara, Frank Liu, and Jeffrey S. Vetter and Ian Jorquera Figure 3: Comparison of bit layout (bit sign, exponent, and fraction) for FP64, FP32, FP16, BF16, and TF32 [19]. last few years, we can expect to have something similar to what we are proposing in the near future. As expected (Figure 5), we see that this new framework can pro- vide more precision than SGEMM. However, like in our previous comparison of 1×TF32 GEMM vs. HGEMM, this improvement does not match the requirements for double-precision results. However, when using 3×TF64 GEMM, we achieve an error equivalent in mag- nitude to that of DGEMM. In this case, there is a noticeable effect from omitting the fourth multiplication, particularly in relatively small matrices, but it remains insignificant when compared with the precision gained from the first three matrix multiplications. Overall, we see that approximating DGEMM with 3×TF64 GEMM provides an error reduction on the order of 107 for 1×TF64. As in the previous analysis, we included the |DGEMM-3×TF64| error, as illustrated on the right side of the graphs (Figure 5). 4 RELATEDWORK Besides NVIDIA Tensor Cores discussed in this paper, several com- panies are also employing and developing specialized hardware for high-performance inference, such as AMD [2], ARM [4], Intel, and Cerebras [3, 31]. But not only hardware vendors are designing AI accelerators. Movidius developed the Myriad 2 Vision Processing Unit [23]. Google designed and developed a Tensor Processing Unit (TPU) specifically for inference workloads [33]. We can find some examples of using multiword arithmetic on low-precision hardware, including the work of Markidis et al. [27], who call this technique precision refinement. Similar approaches can be found for FFT [32, 35], although most of the state-of-the-art references focus onmatrix-matrix multiplication [27, 30]. Also, as in our work, these techniques were used to propose new ideas for more efficient and faster hardware [13] (e.g., block fused multiply–add units on future Intel hardware). Iterative refinement solvers [12] is another successful example of using AI accelerators in HPC. This kind of algorithm can effec- tively use the low-precision AI tensor core and reach the necessary high precision required by HPC applications. Indeed, these were the techniques implemented for HPL-AI [20, 21] to reach the exas- cale by using AI accelerators. However, these algorithms require computing the same kind of operations repeatedly because of the iterative nature of these algorithms, which may reduce the poten- tial benefit in terms of the performance of using tensor cores-like processors. In general, the number of iterations depends on the coefficient of the matrix to be computed, in other words, the condi- tion number [12]. In fact, in certain cases, it can be difficult to reach a good precision [12]. Unlike, iterative solvers, the use of direct solvers can alleviate such limitations in terms of both: performance and precision. Indeed, the work presented in this paper can be used in such direct solvers [38]. As mentioned, this work extends the previously published work by M. Fasi et al. [10] by using the new TF32 format. Also, as far as we know, ours is the first work to propose and analyze a new framework (TF64) for double-precision operation on Tensor Cores. 5 CONCLUSIONS AND FUTURE DIRECTIONS We extended the state of the art by using the new TF32 Tensor Core format to provide accurate solutions by using relatively simple techniques based on multiword arithmetic, all without the costly 32-bit to 16-bit software-level castings. Also, the achieved perfor- mance is equivalent to or higher than previous solutions based on the FP16 Tensor Core framework. This work also proposed a novel Tensor Core framework called TF64 for double-precision and demonstrated the potential effectiveness of mixed-precision Tensor Cores to accelerate double-precision GEMM for scientific and HPC applications. More analyses are required to study new TF64 formats by using even lower-precision formats with lower exponent and fractions bits. Further work is also needed to expand this analysis to other AI core hardware, such as AMD Matrix Cores, ARM SME, and others. ACKNOWLEDGMENTS This research used resources of the Oak Ridge Leadership Comput- ing Facility and the Experimental Computing Laboratory at the Oak Ridge National Laboratory, which is supported by DOE’s Office of 182 Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low-Precision AI Tensor Cores SC-W 2023, November 12–17, 2023, Denver, CO, USA Figure 4: SGEMM and TF32 mixed-precision error analysis. 183 SC-W 2023, November 12–17, 2023, Denver, CO, USA Pedro Valero-Lara, Frank Liu, and Jeffrey S. Vetter and Ian Jorquera Figure 5: DGEMM and TF64 mixed-precision error analyis. 184 Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low-Precision AI Tensor Cores SC-W 2023, November 12–17, 2023, Denver, CO, USA Science under Contract No. DE-AC05-00OR22725. This research was supported in part by the Exascale Computing Project (17-SC- 20-SC), a collaborative effort of the DOE’s Office of Science and the National Nuclear Security Administration. This manuscript has been authored by UT-Battelle LLC under Contract No. DE-AC05- 00OR22725 with the DOE. The publisher, by accepting the article for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of the manuscript or allow others to do so, for US Government purposes. The DOE will provide public access to these results in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). REFERENCES [1] 2018. The GCC Quad-Precision Math Library. https://gcc.gnu.org/onlinedocs/ gcc-8.2.0/libquadmath.pdf [Online; accessed 30-May-2023]. [2] 2021. AMD Instinct MI250X Accelerator. https://www.amd.com/en/products/ server-accelerators/instinct-mi250x. [Online; accessed 30-May-2023]. [3] 2022. Cerebras. https://www.cerebras.net/. [Online; accessed 30-May-2023]. [4] 2022. The Scalable Matrix Extension (SME), for Armv9-A. https://developer.arm. com/documentation/ddi0616/latest. [Online; accessed 30-May-2023]. [5] Ahmad Abdelfattah, Hartwig Anzt, Erik G. Boman, Erin C. Carson, Terry Cojean, Jack J. Dongarra, Alyson Fox, Mark Gates, Nicholas J. Higham, Xiaoye S. Li, Jennifer A. Loe, Piotr Luszczek, Srikara Pranesh, Siva Rajamanickam, Tobias Ribizel, Barry F. Smith, Kasia Swirydowicz, Stephen J. Thomas, Stanimire Tomov, Yaohung M. Tsai, and Ulrike Meier Yang. 2021. A survey of numerical linear alge- bra methods utilizing mixed-precision arithmetic. Int. J. High Perform. Comput. Appl. 35, 4 (2021). https://doi.org/10.1177/10943420211003313 [6] Ehsan Atoofian. 2023. PTTS: Power-aware tensor cores using two-sided sparsity. J. Parallel Distributed Comput. 173 (2023), 70–82. https://doi.org/10.1016/j.jpdc. 2022.11.004 [7] Pierre Blanchard, Nicholas J. Higham, Florent Lopez, Theo Mary, and Srikara Pranesh. 2020. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing 42 (2020). https://doi.org/10.1137/19M1289546 [8] Jack J. Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim YarKhan, Maksims Abalenkovs, Negin Bagherpour, Sven Hammarling, Jakub Sístek, David Stevens, Mawussi Zounon, and Samuel D. Relton. 2019. PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP. ACM Trans. Math. Softw. 45, 2 (2019), 16:1–16:35. https://doi.org/10.1145/3264491 [9] Mohammed A. Al Farhan, Ahmad Abdelfattah, Stanimire Tomov, Mark Gates, Dalal Sukkari, Azzam Haidar, Robert Rosenberg, and Jack J. Dongarra. 2020. MAGMA templates for scalable linear algebra on emerging architectures. Int. J. High Perform. Comput. Appl. 34, 6 (2020). https://doi.org/10.1177/ 1094342020938421 [10] Massimiliano Fasi, Nicholas J. Higham, Florent Lopez, Théo Mary, and Mantas Mikaitis. 2023. Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores. SIAM J. Sci. Comput. 45, 1 (2023), 1. https://doi.org/10.1137/21m1465032 [11] Massimiliano Fasi, Nicholas J. Higham, Florent Lopez, Theo Mary, and Mantas Mikaitis. 2023. Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores. (2023). http://eprints.maths.manchester. ac.uk/id/eprint/286 MIMS Preprint(submitted). [12] Azzam Haidar, Stanimire Tomov, Jack J. Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed upmixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11-16, 2018. IEEE / ACM, 47:1–47:11. http://dl.acm.org/citation. cfm?id=3291719 [13] Greg Henry, Ping Tak Peter Tang, and Alexander Heinecke. 2019. Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations. In 26th IEEE Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June 10-12, 2019, Naofumi Takagi, Sylvie Boldo, and Martin Langhammer (Eds.). IEEE, 69–76. https://doi.org/10.1109/ARITH.2019.00019 [14] Zhuoran Ji and Cho-Li Wang. 2022. Efficient exact K-nearest neighbor graph construction for billion-scale datasets using GPUs with tensor cores. In ICS ’22: 2022 International Conference on Supercomputing, Virtual Event, June 28 - 30, 2022, Lawrence Rauchwerger, Kirk W. Cameron, Dimitrios S. Nikolopoulos, and Dionisios N. Pnevmatikatos (Eds.). ACM, 10:1–10:12. https://doi.org/10.1145/ 3524059.3532368 [15] Marc Jordà, Pedro Valero-Lara, andAntonio J. Peña. 2019. Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs. IEEE Access 7 (2019), 70461–70473. https://doi.org/10.1109/ACCESS.2019.2918851 [16] Marc Jordà, Pedro Valero-Lara, and Antonio J. Peña. 2022. cuConv: CUDA implementation of convolution for CNN inference. Clust. Comput. 25, 2 (2022), 1459–1473. https://doi.org/10.1007/s10586-021-03494-y [17] Awais Khan, Hyogi Sim, Sudharshan S. Vazhkudai, Ali Raza Butt, and Youngjae Kim. 2021. An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers. In HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region, Virtual Event, Republic of Korea, January 20-21, 2021, Soonwook Hwang and Heon Young Yeom (Eds.). ACM, 11–22. https://doi.org/10.1145/3432261.3432263 [18] Hyeonjin Kim and William J. Song. 2023. LAS: Locality-Aware Scheduling for GEMM-Accelerated Convolutions in GPUs. IEEE Trans. Parallel Distributed Syst. 34, 5 (2023), 1479–1494. https://doi.org/10.1109/TPDS.2023.3247808 [19] Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam, and Sridhar Ra- maswamy. 2023. NVIDIA Ampere Architecture In-Depth. https://developer. nvidia.com/blog/nvidia-ampere-architecture-in-depth/ [Online; accessed 30- May-2023]. [20] Shuhei Kudo, Keigo Nitadori, Takuya Ina, and Toshiyuki Imamura. 2020. Imple- mentation and Numerical Techniques for One EFlop/s HPL-AI Benchmark on Fugaku. In 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA@SC 2020, Atlanta, GA, USA, November 13, 2020. IEEE, 69–76. https://doi.org/10.1109/ScalA51936.2020.00014 [21] Shuhei Kudo, Keigo Nitadori, Takuya Ina, and Toshiyuki Imamura. 2020. Prompt Report on Exa-Scale HPL-AI Benchmark. In IEEE International Conference on Cluster Computing, CLUSTER 2020, Kobe, Japan, September 14-17, 2020. IEEE, 418–419. https://doi.org/10.1109/CLUSTER49012.2020.00058 [22] Wai-Kong Lee, Hwajeong Seo, Zhenfei Zhang, and Seong Oun Hwang. 2022. TensorCrypto: High Throughput Acceleration of Lattice-Based Cryptography Using Tensor Core on GPU. IEEE Access 10 (2022), 20616–20632. https://doi.org/ 10.1109/ACCESS.2022.3152217 [23] Vasileios Leon, Kiamal Z. Pekmestzi, and Dimitrios Soudris. 2022. Systematic Embedded Development and Implementation Techniques on Intel Myriad VPUs. In 30th IFIP/IEEE 30th International Conference on Very Large Scale Integration, VLSI-SoC 2022, Patras, Greece, October 3-5, 2022. IEEE, 1–2. https://doi.org/10. 1109/VLSI-SoC54400.2022.9939592 [24] Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022. Efficient Quantized Sparse Matrix Operations on Tensor Cores. CoRR abs/2209.06979 (2022). https://doi. org/10.48550/arXiv.2209.06979 arXiv:2209.06979 [25] Neil Lindquist, Piotr Luszczek, and Jack J. Dongarra. 2022. Accelerating Restarted GMRES With Mixed Precision Arithmetic. IEEE Trans. Parallel Distributed Syst. 33, 4 (2022), 1027–1037. https://doi.org/10.1109/TPDS.2021.3090757 [26] Lixiang Luo, Tjerk P. Straatsma, Luis Enrique Aguilar-Suárez, Ria Broer, Dmytro Bykov, Eduardo F. D’Azevedo, Shirin S. Faraji, Kalyana C. Gottiparthi, Coen de Graaf, James Austin Harris, Remco W. A. Havenith, Hans Jørgen Aagard Jensen, Wayne Joubert, R. K. Kathir, Jeff Larkin, Ying Wai Li, Dmitry I. Lyakh, O. E. Bronson Messer, Matthew R. Norman, Joseph C. Oefelein, Ramanan Sankaran, Andreas F. Tillack, Ashleigh L. Barnes, Lucas Visscher, Jack C. Wells, and Meilani Wibowo. 2020. Pre-exascale accelerated application development: The ORNL Summit experience. IBM J. Res. Dev. 64, 3/4 (2020), 11:1–11:21. https://doi.org/10.1147/JRD.2020.2965881 [27] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2018, Vancouver, BC, Canada, May 21-25, 2018. IEEE Computer Society, 522–531. https://doi.org/10.1109/IPDPSW.2018.00091 [28] Narasinga Rao Miniskar, Mohammad Alaul Haque Monil, Pedro Valero-Lara, Frankie Y. Liu, and Jeffrey S. Vetter. 2022. IRIS-BLAS: Towards a Performance Portable and Heterogeneous BLAS Library. In 29th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2022, Bengaluru, India, December 18-21, 2022. IEEE, 256–261. https://doi.org/10.1109/HiPC56025.2022. 00042 [29] Mohammad Alaul Haque Monil, Narasinga Rao Miniskar, Frank Y. Liu, Jeffrey S. Vetter, and Pedro Valero-Lara. 2022. LaRIS: Targeting Portability and Produc- tivity for LAPACK Codes on Extreme Heterogeneous Systems by Using IRIS. In IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop, RSDHA@SC 2022, Dallas, TX, USA, November 13-18, 2022. IEEE, 12–21. https://doi.org/10.1109/RSDHA56811.2022.00007 [30] Daichi Mukunoki and Takeshi Ogita. 2020. Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. J. Comput. Appl. Math. 372 (2020), 112701. https://doi.org/10.1016/j.cam.2019.112701 [31] Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman P. Jouppi, and David A. Patterson. 2021. The Design Process for Google’s Training Chips: TPUv2 and TPUv3. IEEE Micro 41, 2 (2021), 56–63. https://doi.org/10.1109/MM.2021.3058217 185 https://gcc.gnu.org/onlinedocs/gcc-8.2.0/libquadmath.pdf https://gcc.gnu.org/onlinedocs/gcc-8.2.0/libquadmath.pdf https://www.amd.com/en/products/server-accelerators/instinct-mi250x https://www.amd.com/en/products/server-accelerators/instinct-mi250x https://www.cerebras.net/ https://developer.arm.com/documentation/ddi0616/latest https://developer.arm.com/documentation/ddi0616/latest https://doi.org/10.1177/10943420211003313 https://doi.org/10.1016/j.jpdc.2022.11.004 https://doi.org/10.1016/j.jpdc.2022.11.004 https://doi.org/10.1137/19M1289546 https://doi.org/10.1145/3264491 https://doi.org/10.1177/1094342020938421 https://doi.org/10.1177/1094342020938421 https://doi.org/10.1137/21m1465032 http://eprints.maths.manchester.ac.uk/id/eprint/286 http://eprints.maths.manchester.ac.uk/id/eprint/286 http://dl.acm.org/citation.cfm?id=3291719 http://dl.acm.org/citation.cfm?id=3291719 https://doi.org/10.1109/ARITH.2019.00019 https://doi.org/10.1145/3524059.3532368 https://doi.org/10.1145/3524059.3532368 https://doi.org/10.1109/ACCESS.2019.2918851 https://doi.org/10.1007/s10586-021-03494-y https://doi.org/10.1145/3432261.3432263 https://doi.org/10.1109/TPDS.2023.3247808 https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ https://doi.org/10.1109/ScalA51936.2020.00014 https://doi.org/10.1109/CLUSTER49012.2020.00058 https://doi.org/10.1109/ACCESS.2022.3152217 https://doi.org/10.1109/ACCESS.2022.3152217 https://doi.org/10.1109/VLSI-SoC54400.2022.9939592 https://doi.org/10.1109/VLSI-SoC54400.2022.9939592 https://doi.org/10.48550/arXiv.2209.06979 https://doi.org/10.48550/arXiv.2209.06979 https://arxiv.org/abs/2209.06979 https://doi.org/10.1109/TPDS.2021.3090757 https://doi.org/10.1147/JRD.2020.2965881 https://doi.org/10.1109/IPDPSW.2018.00091 https://doi.org/10.1109/HiPC56025.2022.00042 https://doi.org/10.1109/HiPC56025.2022.00042 https://doi.org/10.1109/RSDHA56811.2022.00007 https://doi.org/10.1016/j.cam.2019.112701 https://doi.org/10.1109/MM.2021.3058217 SC-W 2023, November 12–17, 2023, Denver, CO, USA Pedro Valero-Lara, Frank Liu, and Jeffrey S. Vetter and Ian Jorquera [32] Louis Pisha and Lukasz Ligowski. 2021. Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores. In 35th IEEE International Parallel and Dis- tributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. IEEE, 507–516. https://doi.org/10.1109/IPDPS49936.2021.00059 [33] Sami Salamin, Georgios Zervakis, Florian Klemme, Hammam Kattan, Yo- gesh Singh Chauhan, Jörg Henkel, and Hussam Amrouch. 2022. Impact of NCFET Technology on Eliminating the Cooling Cost and Boosting the Effi- ciency of Google TPU. IEEE Trans. Computers 71, 4 (2022), 906–918. https: //doi.org/10.1109/TC.2021.3065454 [34] Jorge Sastre and Jacinto Javier Ibáñez. 2023. On the backward and forward error of approximations of analytic functions and applications to the computation of matrix functions. J. Comput. Appl. Math. 419 (2023), 114706. https://doi.org/10. 1016/j.cam.2022.114706 [35] Anumeena Sorna, Xiaohe Cheng, Eduardo F. D’Azevedo, Kwai Wong, and Stan- imire Tomov. 2018. Optimizing the Fast Fourier Transform Using Mixed Precision on Tensor Core Hardware. In 25th IEEE International Conference on High Per- formance Computing Workshops, HiPCW 2018, Bengaluru, India, December 17-20, 2018. IEEE, 3–7. https://doi.org/10.1109/HiPCW.2018.8634417 [36] Qiao Sun, Wenjing Ma, Jiachang Sun, and Huiyuan Li. 2023. Evolving the HPL benchmark towards multi-GPGPU clusters. CCF Trans. High Perform. Comput. 5, 1 (2023), 84–96. https://doi.org/10.1007/s42514-022-00128-6 [37] Yufei Sun, Long Zheng, Qinggang Wang, Xiangyu Ye, Yu Huang, Pengcheng Yao, Xiaofei Liao, and Hai Jin. 2022. Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores. In IEEE High Performance Extreme Computing Conference, HPEC 2022, Waltham, MA, USA, September 19-23, 2022. IEEE, 1–7. https://doi.org/10.1109/HPEC55821.2022.9926300 [38] Pedro Valero-Lara, Sandra Catalán, Xavier Martorell, Tetsuzo Usui, and Jesús Labarta. 2020. sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library). J. Parallel Distributed Comput. 138 (2020), 153–171. https://doi.org/10.1016/j.jpdc.2019.12.002 [39] Pedro Valero-Lara, Jungwon Kim, and Jeffrey S. Vetter. 2022. A Portable and Heterogeneous LU Factorization on IRIS. In Euro-Par 2022: Parallel Processing Workshops - Euro-Par 2022 International Workshops, Glasgow, UK, August 22- 26, 2022, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 13835), Jeremy Singer, Yehia Elkhatib, Dora Blanco Heras, Patrick Diehl, Nick Brown, and Aleksandar Ilic (Eds.). Springer, 17–31. https://doi.org/10.1007/978-3-031- 31209-0_2 [40] Pedro Valero-Lara, IvanMartínez-Pérez, SergiMateo, Raül Sirvent, Vicenç Beltran, Xavier Martorell, and Jesús Labarta. 2018. Variable Batched DGEMM. In 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, United Kingdom, March 21-23, 2018, Ivan Merelli, Pietro Liò, and Igor V. Kotenko (Eds.). IEEE Computer Society, 363–367. https: //doi.org/10.1109/PDP2018.2018.00065 A ARTIFACT DESCRIPTION FOR REPRODUCIBILITY The code used for this work and analysis is accessible via a pub- lic GitHub repository 2. We used one NVIDIA A100 GPU for the TF32 error and performance analysis, while the TF64 analysis was computed through software using the Intel Xeon E5-2698 v4 CPU to mimic possible future computing frameworks. The code and the corresponding Makefile, which contains the details of the software stack used, can be found in the /src folder. Also, we provide the script (found in the /script folder) that uses the data (found in the /data folder) collected in our experiments to generate the plots (found in the /plot folder) presented in this work. 2https://github.com/pedrovalerolara/TF32-TF64.git 186 https://doi.org/10.1109/IPDPS49936.2021.00059 https://doi.org/10.1109/TC.2021.3065454 https://doi.org/10.1109/TC.2021.3065454 https://doi.org/10.1016/j.cam.2022.114706 https://doi.org/10.1016/j.cam.2022.114706 https://doi.org/10.1109/HiPCW.2018.8634417 https://doi.org/10.1007/s42514-022-00128-6 https://doi.org/10.1109/HPEC55821.2022.9926300 https://doi.org/10.1016/j.jpdc.2019.12.002 https://doi.org/10.1007/978-3-031-31209-0_2 https://doi.org/10.1007/978-3-031-31209-0_2 https://doi.org/10.1109/PDP2018.2018.00065 https://doi.org/10.1109/PDP2018.2018.00065 https://github.com/pedrovalerolara/TF32-TF64.git Abstract 1 Introduction 2 Mixed-Precision S/DGEMM using TF32 and TF64 Tensor Core Frameworks 2.1 Tensor Cores 2.2 Multiword Arithmetic 2.3 TF32 2.4 TF64 3 Analysis 3.1 Error Computation 3.2 SGEMM and TF32 3.3 DGEMM and TF64 4 Related Work 5 Conclusions and Future Directions Acknowledgments References A Artifact Description for Reproducibility