I Wrote an MXFP4 Quantization Kernel and Ranked #1 on Tensara

Why I Did This I’m building an FP4 fused attention kernel for consumer Blackwell GPUs (SM120). That means I spend my days thinking about how to squeeze 32-bit numbers into 4 bits without losing too much information. Tensara is a platform where you submit GPU kernels and compete on real hardware. They had an MXFP4 quantization problem with almost no submissions. I figured: I already know this format inside out on SM120, how hard can it be to write a standalone quantization kernel? ...

April 5, 2026 · 27 min · 5701 words · Florian Mattana

Building an FP4 Fused Attention Kernel on Consumer Blackwell (SM120)

1. Why FP4 Fused Attention on Consumer Blackwell? The attention mechanism in transformers scales quadratically with sequence length. On a consumer GPU with 12 GB of VRAM and 672 GB/s of memory bandwidth, that becomes a hard wall very quickly. The interesting thing about the RTX 5070 Ti (SM120, 46 SMs) is the raw throughput the Tensor Cores can deliver: Precision Throughput FP16 123.5 TFLOPS INT8 246.9 TFLOPS FP4 ~474 TFLOPS That is roughly a 4x advantage going from FP16 to FP4, and since FP4 values are four times smaller, you also move four times less data through memory. On paper, that is a massive win for attention. If you can actually use the FP4 Tensor Cores. ...

March 17, 2026 · 39 min · 8211 words · Florian Mattana