Building an FP4 Fused Attention Kernel on Consumer Blackwell (SM120) (WIP)

I have been trying to write a fused FP4 attention kernel that runs on consumer Blackwell GPUs and specifically the RTX 5070 Ti. This post documents the full journey: every wrong turn, every hardware surprise, and every trade-off I had to make along the way. 1. Why FP4 Fused Attention on Consumer Blackwell? The attention mechanism in transformers scales quadratically with sequence length. On a consumer GPU with 12 GB of VRAM and 672 GB/s of memory bandwidth, that becomes a hard wall very quickly. The interesting thing about the RTX 5070 Ti (SM120, 46 SMs) is the raw throughput the Tensor Cores can deliver: ...

March 17, 2026 · 11 min · 2249 words · Florian Mattana

From Silicon to Thread Identity: How CUDA Threads Know Who They Are

The Natural Question When you start learning CUDA, you use threadIdx.x, blockIdx.x and blockDim.x like magic variables that always contain the right value. At some point, you naturally start wondering: how are these values computed? Is there a function somewhere in the CUDA runtime that produces them? Can you see the source code behind them? The answer is surprising: there is no code. These values are not the result of a software computation. They come directly from the hardware. ...

February 13, 2026 · 8 min · 1612 words · Florian Mattana

Exploring PTX: A Close Look at Tile Optimization in CUDA

Originally, I suggested on LinkedIn that I would write an article to unabstract a PyTorch program and reveal some potential CUDA optimizations we could make by debugging it. I actually took another road, but I kept in mind the desire to dig deeper from a given point of abstraction. For context, I am currently writing a small program to compare some kernels with different optimization choices or intentionally unoptimized kernels to highlight the differences between various implementation options. ...

January 15, 2026 · 10 min · 1925 words · Florian Mattana