We implemented a sophisticated matrix multiplication engine in CubeCL that rivals the performance of cuBLAS and CUTLASS while supporting a wider range of GPUs. Leveraging double buffering, tensor cores, and vectorization, it compiles seamlessly to CUDA, ROCm, WebGPU, Metal, and Vulkan backends without relying on proprietary or third-party binaries. Matrix multiplication is central to modern AI workloads, especially transformers, and optimizing it ourselves was essential to enable kernel fusion and achieve state-of-the-art performance across platforms in a deep learning framework.