I benchmarked three sparse SpMSpM algorithms — inner-product, outer-product, and Gustavson's algorithm — against vendor-optimized libraries, namely cuSPARSE on NVIDIA T4 and rocSPARSE on AMD MI300x, using LLM-realistic top-k attention sparsity. The results surprised me: all three Triton implementations outperformed the vendor libraries in every test case, with the outer-product approach achieving up to a 198× speedup over rocSPARSE.

This post explains what I benchmarked, what I found, and — most importantly — why outer-product wins.

Background

Modern LLMs exhibit natural sparsity in their attention score matrices. After softmax, most attention weights are near-zero — only a small fraction of token pairs carry significant attention mass. This is the "heavy hitter" phenomenon observed by H2O, SnapKV, and StreamingLLM. Exploiting this sparsity requires efficient sparse matrix multiplication (SpMSpM) kernels.

The Zhang et al. analyse inner-product, outer-product, and Gustavson's algorithms for SpMSpM on hardware accelerators. My work extends this to GPU hardware using Triton and adds a direct comparison against vendor-optimised sparse libraries.

Experimental Setup

Hardware

HardwareArchitectureMemoryPeak BW
NVIDIA T4Turing, Tensor Cores16GB GDDR6300 GB/s
AMD MI300xCDNA3, Matrix Cores192GB HBM35,300 GB/s

Workloads

6 attention score matrix shapes from LLaMA 1-7B — Q×KT and scores×V at sequence lengths 512, 1024, and 2048. Sparsity levels: 70%, 90%, 95% (top-k per row, LLM-realistic).

Algorithms

AlgorithmStrategyFormat
VendorcuSPARSE / rocSPARSECSR
Inner-productC[i,j] = dot(A[i,:], B[:,j])A: CSR, B: dense
Outer-productC += Σ A[:,k] × B[k,:]dense tiling + tl.dot
GustavsonC[i,:] += A[i,k] * B[k,:]A: CSR, B: dense

Results

Algorithm comparison

Outer-product wins 36/36 cases on both T4 and MI300x. All three Triton algorithms beat vendor on 54/54 cases. Maximum speedup: 198x over rocSPARSE on MI300x at 70% sparsity.

Winning algorithm heatmap across hardware and sparsity levels

Winning algorithm across all shapes, sparsity levels, and hardware. Outer-product wins every single case.

Speedup vs vendor baseline at 90% sparsity

Speedup vs vendor baseline at 90% sparsity. Outer-product achieves 20-40x speedup across all shapes.

Sparsity sensitivity

Higher sparsity amplifies the advantage. At 70% sparsity, outer-product achieves a mean 73x speedup over vendor. At 95%, this narrows to 15x — still very significant.

SparsityT4 mean speedupMI300x mean speedup
70%73.5x70.2x
90%24.2x32.1x
95%15.3x24.9x

GPU reverses the CPU algorithm ranking

The GAMMA paper found Gustavson outperforms inner-product on tensor accelerators. On GPU, the two are essentially identical — within 3% of each other on both T4 and MI300x. Outer-product dominates both on GPU.

Why Outer-Product Wins

The performance gap is explained by memory bandwidth utilization. I measured achieved memory bandwidth for each algorithm:

Memory bandwidth utilization per algorithm

Mean achieved memory bandwidth per algorithm. Outer-product achieves 3.3x higher bandwidth than inner/Gustavson on T4, and 6.7x higher on MI300x.

AlgorithmT4 (300 GB/s peak)MI300x (5300 GB/s peak)
outer_product52.5 GB/s (17.5%)272 GB/s (5.1%)
inner_product16.0 GB/s (5.3%)40.6 GB/s (0.8%)
gustavson16.6 GB/s (5.5%)39.6 GB/s (0.7%)
vendor1.4 GB/s (0.5%)3.6 GB/s (0.1%)

The reason is algorithmic. Outer-product processes the computation as:

C += sum_k outer(A[:,k], B[k,:])  # rank-1 updates via tl.dot()

This maps directly to tl.dot() in Triton, which compiles to Tensor Core (NVIDIA) or Matrix Core (AMD) operations. The memory access pattern is regular and coalesced — tiles of A and B are loaded sequentially, accumulated into the output.

Inner-product and Gustavson both require gather operations — loading specific rows of B indexed by sparse column indices of A. This produces scattered memory loads that cannot leverage Tensor Cores efficiently, resulting in 3x lower bandwidth utilization.

Vendor libraries (cuSPARSE/rocSPARSE) use CSR format and process only non-zeros — but cannot utilize Tensor Cores for sparse operations, achieving less than 1% of peak bandwidth.

Key insight: For top-k attention sparsity, outer-product's regular tiling pattern enables Tensor Core utilization — achieving 3-7x higher memory bandwidth than gather-based algorithms and 35-140x higher than vendor CSR kernels.

MI300x: compute-bound, not memory-bound

On MI300x, outer-product achieves 272 GB/s against a 5,300 GB/s peak — only 5.1% utilization. This doesn't mean the kernel is inefficient. It means the kernel is compute-bound (Tensor Core limited), not memory-bound. MI300x's HBM3 bandwidth is so large that it feeds the Tensor Cores faster than they can consume data — the bottleneck shifts from memory to compute. This is why MI300x achieves even larger speedups than T4 for outer-product.

Dense Dataflow: a Separate Finding

I also benchmarked Weight Stationary (WS), Input Stationary (IS), and Output Stationary (OS) dataflow strategies for dense matmul across 13 LLaMA 1-7B shapes.

Winning dataflow strategy heatmap

Winning dataflow strategy per hardware and shape. IS wins on MI300x (13/13), WS wins on T4 (11/13).

Dataflow choice is hardware-dependent. IS wins on MI300x for all shapes — HBM3 bandwidth absorbs dataflow differences. WS wins on T4 for 11/13 shapes — bandwidth constraints make weight reuse critical. OS is consistently worst (3.3x penalty on T4, 1.26x on MI300x).

Code

All kernels, benchmark scripts, results, and analysis are available at: github.com/midiareshadi/dataflow-bench

git clone https://github.com/midiareshadi/dataflow-bench.git
cd dataflow-bench
pip install torch triton pandas numpy matplotlib

# Dense dataflow benchmark
python3 dense/benchmark_t4.py

# Sparse SpMSpM benchmark
python3 sparse/benchmark_t4.py

# Bandwidth utilization analysis
python3 sparse/analysis/roofline.py

References