I Wrote an MXFP4 Quantization Kernel and Ranked #1 on Tensara

Why I Did This I’m building an FP4 fused attention kernel for consumer Blackwell GPUs (SM120). That means I spend my days thinking about how to squeeze 32-bit numbers into 4 bits without losing too much information. Tensara is a platform where you submit GPU kernels and compete on real hardware. They had an MXFP4 quantization problem with almost no submissions. I figured: I already know this format inside out on SM120, how hard can it be to write a standalone quantization kernel? ...

April 5, 2026 · 27 min · 5701 words · Florian Mattana