Building an FP4 Fused Attention Kernel on Consumer Blackwell (SM120)

1. Why FP4 Fused Attention on Consumer Blackwell? The attention mechanism in transformers scales quadratically with sequence length. On a consumer GPU with 12 GB of VRAM and 672 GB/s of memory bandwidth, that becomes a hard wall very quickly. The interesting thing about the RTX 5070 Ti (SM120, 46 SMs) is the raw throughput the Tensor Cores can deliver: Precision Throughput FP16 123.5 TFLOPS INT8 246.9 TFLOPS FP4 ~474 TFLOPS That is roughly a 4x advantage going from FP16 to FP4, and since FP4 values are four times smaller, you also move four times less data through memory. On paper, that is a massive win for attention. If you can actually use the FP4 Tensor Cores. ...

March 17, 2026 · 39 min · 8211 words · Florian Mattana