DeepGEMM: Clean and Efficient FP8 GEMM Library
Introduction
DeepGEMM is a clean and efficient FP8 General Matrix Multiplication (GEMM) library with fine-grained scaling, released by DeepSeek as part of their “Open Source Week” in February 2025. It supports both normal dense GEMMs and Mixture-of-Experts (MoE) grouped GEMMs, providing high-performance matrix operations critical for modern AI models.
Background and Motivation
Matrix multiplication is at the heart of deep learning computations, particularly in transformer-based models. As models continue to scale, there’s an increasing need for more efficient matrix operations that can leverage hardware capabilities while maintaining numerical stability. FP8 (8-bit floating point) has emerged as a promising data format that offers a balance between precision and computational efficiency.
However, implementing efficient FP8 GEMM operations is challenging, especially when considering the complexities of modern GPU architectures and the need for fine-grained scaling to maintain numerical stability. Existing libraries often involve complex template metaprogramming that makes them difficult to understand and modify.
DeepGEMM was developed to address these challenges by providing a clean, efficient, and accessible implementation of FP8 GEMM operations that achieves high performance while maintaining readability and extensibility.
Key Features and Capabilities
DeepGEMM offers several key features that make it valuable for AI model development:
- Lightweight Design: Core kernel function of only ~300 lines of code, making it accessible as a learning resource while still delivering high performance.
- Just-In-Time (JIT) Compilation: No compilation needed during installation, as kernels are compiled at runtime using a lightweight JIT module.
- FP8 Support with Fine-Grained Scaling: Implements efficient FP8 operations with fine-grained scaling to maintain numerical stability.
- Multiple GEMM Formats:
- Normal dense GEMM for standard matrix operations
- Grouped contiguous GEMM for MoE models with contiguous layout
- Grouped masked GEMM for MoE models with masked layout
- High Performance: Achieves up to 1350+ FP8 TFLOPS on Hopper GPUs, matching or exceeding expert-tuned libraries across various matrix shapes.
- Auto-Tuning: Automatically selects optimal kernel configurations for different matrix shapes and hardware setups.
Technical Implementation
DeepGEMM is implemented as a combination of Python and CUDA components with a focus on clean design and runtime optimization. The implementation consists of several key components:
- JIT Compilation System:
- Compiles CUDA kernels at runtime using templates
- Caches compiled kernels for reuse
- Supports FFMA interleaving optimization for better performance
- GEMM Kernels:
- Normal dense GEMM:
gemm_fp8_fp8_bf16_nt
- Grouped contiguous GEMM:
m_grouped_gemm_fp8_fp8_bf16_nt_contiguous
- Grouped masked GEMM:
m_grouped_gemm_fp8_fp8_bf16_nt_masked
- Normal dense GEMM:
- Auto-Tuning System:
- Automatically selects optimal kernel configurations
- Tunes parameters like block sizes, number of stages, and TMA multicast
- Fine-Grained Scaling:
- Supports 1×128 LHS scaling and 128×128 RHS scaling
- Implements efficient TMA-aligned tensor handling
The library addresses the imprecise FP8 tensor core accumulation issue by implementing CUDA-core two-level accumulation (promotion), ensuring numerical stability without sacrificing performance.
Performance and Benchmarks
DeepGEMM demonstrates impressive performance across a wide range of matrix shapes and configurations:
Normal GEMMs for dense models
- Achieves up to 1358 TFLOPS for large matrices (4096x7168x16384)
- Provides 1.1x-2.7x speedup compared to optimized CUTLASS implementation
- Excellent performance for both memory-bound and compute-bound configurations
Grouped GEMMs for MoE models
- Supports both contiguous and masked layouts
- Achieves up to 1297 TFLOPS for grouped contiguous layout
- Provides 1.1x-1.2x speedup compared to optimized implementations
These performance characteristics make DeepGEMM particularly well-suited for large-scale models like DeepSeek-V3/R1, where efficient matrix operations are crucial for both training and inference.
Integration and Usage
DeepGEMM is designed to be easily integrated into existing deep learning frameworks. It provides a PyTorch-compatible interface that can be used to replace standard matrix multiplication operations with optimized FP8 implementations. The typical usage pattern involves:
- Preparing input tensors in the appropriate format (FP8 with scaling factors)
- Calling the appropriate GEMM function based on the operation type (normal, grouped contiguous, or grouped masked)
- Processing the output tensor (BF16) as needed
The library also provides utility functions for tensor format conversion and configuration management.
Conclusion
DeepGEMM represents a significant advancement in optimizing matrix operations for modern AI models. By providing a clean, efficient implementation of FP8 GEMM operations, it addresses one of the key computational bottlenecks in large-scale models. Its support for both normal dense GEMMs and MoE grouped GEMMs makes it versatile for a wide range of model architectures, while its optimization for modern hardware ensures it can deliver maximum performance on current and future systems.
What sets DeepGEMM apart is its balance of performance and accessibility. While achieving performance that matches or exceeds expert-tuned libraries, it maintains a clean, readable codebase that serves as both a practical tool and an educational resource for understanding GPU optimization techniques. As AI models continue to scale and efficiency becomes increasingly critical, tools like DeepGEMM will play an essential role in pushing the boundaries of what’s possible.
No comment