In this blog post, we’ll walk through an implementation of the SGEMM (Single-precision GEneral Matrix Multiply) operation defined as C := alpha*A*B + beta*C. We will review three different kernels, each optimized for specific matrix size problems. Our final implementation is optimized for Ampere architecture and outperforms cuBLAS on wide range of matrix size problems.