SASS King, Part 2: Reading the Compiler's Mind

1. Why a Second Article When I started SASS King I didn’t know really the purpose behind it, except knowing better the compiler behaviour. Because the more I went deep into CUDA, the more I saw myself drawn by the click between software and hardware, that tiny frontier few people actually care about. As a GPU programmer, when you write a kernel you still think at a level of abstraction that delegates a lot of superpower to the CUDA machinery. And nowadays NVIDIA tends to keep abstracting up the stack to give developers the ease to focus on high level challenges that are already numerous. As we will all agree that the principal use cases are LLM architectures, we can start to assess that the main challenges a developer faces are not really at the register level. The further forward we go, we can even start to conclude that the main bottlenecks are rarely at compute level and more at memory level, how we transfer data between H2D, D2D or D2H. A recent documented case makes this concrete. Anam, a company building real-time AI avatars, found that their Cara-3 inference pipeline was not GPU-bound at all. The GPU would run a burst of kernels, then sit idle, then run again. The bottleneck was Python runtime overhead stalling CPU dispatch and leaving the GPU waiting. Fixing the dispatch path, not the kernels, gave them a 2.5x latency improvement on the same hardware. Considering this, trying to achieve a full understanding of deep machinery seems like being on the wrong side of the risk/reward ratio. ...

May 8, 2026 · 16 min · 3397 words · Florian Mattana

SASS King, Part 1: Reading NVIDIA SASS from First Principles

SASS is the machine code NVIDIA GPUs actually execute. PTX is documented, SASS is not. This is the first entry in a long form project to build a complete, architecture by architecture reference for NVIDIA SASS, starting from SM120 where I have hardware.

April 17, 2026 · 19 min · 4038 words · Florian Mattana

I Wrote an MXFP4 Quantization Kernel and Ranked #1 on Tensara

Why I Did This I’m building an FP4 fused attention kernel for consumer Blackwell GPUs (SM120). That means I spend my days thinking about how to squeeze 32-bit numbers into 4 bits without losing too much information. Tensara is a platform where you submit GPU kernels and compete on real hardware. They had an MXFP4 quantization problem with almost no submissions. I figured: I already know this format inside out on SM120, how hard can it be to write a standalone quantization kernel? ...

April 5, 2026 · 27 min · 5701 words · Florian Mattana

Building an FP4 Fused Attention Kernel on Consumer Blackwell (SM120)

1. Why FP4 Fused Attention on Consumer Blackwell? The attention mechanism in transformers scales quadratically with sequence length. On a consumer GPU with 12 GB of VRAM and 672 GB/s of memory bandwidth, that becomes a hard wall very quickly. The interesting thing about the RTX 5070 Ti (SM120, 46 SMs) is the raw throughput the Tensor Cores can deliver: Precision Throughput FP16 123.5 TFLOPS INT8 246.9 TFLOPS FP4 ~474 TFLOPS That is roughly a 4x advantage going from FP16 to FP4, and since FP4 values are four times smaller, you also move four times less data through memory. On paper, that is a massive win for attention. If you can actually use the FP4 Tensor Cores. ...

March 17, 2026 · 39 min · 8211 words · Florian Mattana

From Silicon to Thread Identity: How CUDA Threads Know Who They Are

The Natural Question When you start learning CUDA, you use threadIdx.x, blockIdx.x and blockDim.x like magic variables that always contain the right value. At some point, you naturally start wondering: how are these values computed? Is there a function somewhere in the CUDA runtime that produces them? Can you see the source code behind them? The answer is surprising: there is no code. These values are not the result of a software computation. They come directly from the hardware. ...

February 13, 2026 · 8 min · 1612 words · Florian Mattana

Exploring PTX: A Close Look at Tile Optimization in CUDA

Originally, I suggested on LinkedIn that I would write an article to unabstract a PyTorch program and reveal some potential CUDA optimizations we could make by debugging it. I actually took another road, but I kept in mind the desire to dig deeper from a given point of abstraction. For context, I am currently writing a small program to compare some kernels with different optimization choices or intentionally unoptimized kernels to highlight the differences between various implementation options. ...

January 15, 2026 · 10 min · 1925 words · Florian Mattana