[{"content":"Why I Did This I\u0026rsquo;m building an FP4 fused attention kernel for consumer Blackwell GPUs (SM120). That means I spend my days thinking about how to squeeze 32-bit numbers into 4 bits without losing too much information.\nTensara is a platform where you submit GPU kernels and compete on real hardware. They had an MXFP4 quantization problem with almost no submissions. I figured: I already know this format inside out on SM120, how hard can it be to write a standalone quantization kernel?\nTurns out the kernel itself was straightforward. What surprised me were the subtle details that took hours to get right. This post walks through everything, step by step, assuming you\u0026rsquo;ve never heard of FP4 or quantization before.\n1. The Problem: Numbers Are Too Big A modern AI model like Llama has billions of parameters. Each parameter is a 32-bit floating point number, that\u0026rsquo;s 4 bytes. A 7-billion parameter model takes 28 GB just to store the weights. That\u0026rsquo;s more than most GPUs can hold.\nWe need to make these numbers smaller. Not fewer numbers, smaller numbers. Instead of 32 bits per number, what if we used 4 bits? That\u0026rsquo;s 8 times less memory. A 28 GB model becomes 3.5 GB.\nThe catch: with 4 bits, you can only represent 16 different values. With 32 bits, you can represent about 4 billion different values. So we\u0026rsquo;re going from 4 billion choices down to 16. We\u0026rsquo;re going to lose information. The question is: how do we lose as little as possible?\nThis is what quantization is. The full process has three steps:\nScale: for each group of 32 values, compute a scale factor that brings those values into a range that 4 bits can represent. Encode: divide each value by its group\u0026rsquo;s scale, then round to the nearest of the 16 representable FP4 values. This is the encoder. Pack: two 4-bit values fit in a single 8-bit byte. We pack them together to save space. Each step has its own pitfalls. This post covers all three, plus the GPU kernel that runs them in parallel on over a million groups simultaneously.\n2. FP4 E2M1: The Format The format we\u0026rsquo;re using is called FP4 E2M1. The name tells you the bit layout: 1 sign bit, 2 exponent bits, 1 mantissa bit. Total: 4 bits.\nWith these 4 bits, you can represent exactly these magnitudes:\n0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0 And their negative versions. That\u0026rsquo;s it. Your number 0.8734 has to become one of these. Your number -147.3 also has to become one of these.\nThe spacing isn\u0026rsquo;t uniform. The gap between 0 and 0.5 is 0.5, but the gap between 4.0 and 6.0 is 2.0. This is by design. Floating point formats have more precision near zero and less precision for large values.\nOther FP4 formats E2M1 is not the only way to split 4 bits. There are other formats, each with different trade-offs:\nFP4 E3M0 uses 3 exponent bits and 0 mantissa bits. No mantissa means you can only represent exact powers of 2: 0.25, 0.5, 1, 2, 4, 8, 16. Much wider range (up to 16 instead of 6) but no precision between powers of 2. The value 1.5 simply doesn\u0026rsquo;t exist in this format. Rarely used in practice because the gaps are too large.\nNF4 (NormalFloat4) takes a completely different approach. Instead of using the sign/exponent/mantissa structure, it picks 16 values that match the statistical distribution of neural network weights, which tend to follow a bell curve. The values are irregularly spaced, clustered where weights are most likely to appear. Used by QLoRA for model fine-tuning. No hardware acceleration on current GPUs, it\u0026rsquo;s purely a software format.\nE2M1 has become the standard for hardware-accelerated FP4 because it offers the best balance between range and precision for typical AI workloads. Both the industry-wide open standard (MXFP4, which we\u0026rsquo;ll cover in section 4) and NVIDIA\u0026rsquo;s proprietary format (NVFP4) use E2M1 for the data itself. They differ in how they handle the scaling, which we\u0026rsquo;ll get to next.\n3. The Range Problem Look at the E2M1 values again. The biggest one is 6.0. What do you do with a number like 150?\nYou can\u0026rsquo;t represent it. 6.0 is the max. Everything above 6.0 gets clamped to 6.0.\nThis is terrible. If your data ranges from -200 to +200, everything gets crushed into [-6, 6] and you lose almost all information.\n4. MXFP4 and Block Scaling The first step of quantization is to divide your numbers by a scale factor before encoding them. If your data goes up to 150, a scale of 32 brings it down to 150 / 32 = 4.69, which fits in the FP4 range. You store the scale alongside the data so you can reverse the division later and recover an approximation of the original values.\nBut one scale for the entire matrix is too coarse. Imagine one region of your matrix has values around 0.001 and another region has values around 100. A scale that works for 100 will crush the 0.001 values to zero.\nThe solution: split the matrix into groups of 32 consecutive elements along each row. Each group gets its own scale, calculated from the largest absolute value in that group. The groups with small values get a small scale (preserving their precision), the groups with large values get a large scale (preventing overflow).\nFor a matrix of M rows and K columns, you get M x (K/32) scale values. For example, a matrix of 1024 rows and 4096 columns produces 1024 x (4096/32) = 1024 x 128 = 131,072 scale values. That\u0026rsquo;s 131,072 bytes (128 KB) of overhead, compared to 16 MB for the original matrix in FP32. Less than 1% overhead.\nMXFP4: the standard This block scaling scheme, combined with the E2M1 data format and specific rules for rounding and scale computation, forms a complete specification called MXFP4 (Microscaling FP4).\nMXFP4 is published by the OCP (Open Compute Project). The OCP is a consortium of tech companies, including Meta, Microsoft, AMD, Intel, and NVIDIA, that defines open hardware and software standards. \u0026ldquo;Open\u0026rdquo; means anyone can implement them without licensing fees. The goal is interoperability: a model quantized in MXFP4 can run on any hardware that supports the standard, not just one vendor\u0026rsquo;s chips.\nWhy not just use NVIDIA\u0026rsquo;s format on NVIDIA hardware? NVIDIA does have their own proprietary format called NVFP4. The data is still E2M1, but the scaling works differently: NVFP4 uses two levels of scale factors (a coarse per-tensor scale and a fine per-block scale with blocks of 16 elements) instead of MXFP4\u0026rsquo;s single scale per block of 32. This gives NVFP4 better dynamic range, but it\u0026rsquo;s tied to NVIDIA hardware.\nNVIDIA\u0026rsquo;s Blackwell GPUs (the B200 that Tensara uses for benchmarking) support both MXFP4 and NVFP4. The Tensara problem asks for MXFP4 specifically, because it uses TorchAO (PyTorch\u0026rsquo;s quantization library) as the reference implementation, and TorchAO implements the OCP standard.\nThis matters for verification: Tensara takes my output, dequantizes it using TorchAO\u0026rsquo;s code, does the same with TorchAO\u0026rsquo;s own quantization, and compares the two results. If my scale computation or rounding doesn\u0026rsquo;t match TorchAO\u0026rsquo;s behavior exactly, the test fails. So understanding how TorchAO implements the spec is just as important as understanding the spec itself.\n5. E8M0: Why the Scale Must Be a Power of 2 The scale is stored in a format called E8M0: 8 exponent bits, 0 mantissa bits. No mantissa means the scale is always an exact power of 2: 1, 2, 4, 8, or going the other way, 0.5, 0.25, 0.125, and so on.\nWhy force powers of 2? Because GPUs can multiply by a power of 2 by just shifting the exponent. It\u0026rsquo;s essentially free in hardware. A scale of 3.7 would require a real floating-point multiplication. A scale of 4.0 = 2^2 is a bit shift. When your Tensor Core is doing billions of these operations per second, \u0026ldquo;free\u0026rdquo; matters.\nThe bias: a thermometer trick The exponent stored in E8M0 can be negative. A scale of 0.125 is 2^(-3), so the exponent is -3. But we\u0026rsquo;re storing this in a single byte (values 0 to 255). How do you fit a negative number in there?\nThink of a thermometer. Temperature can go below zero (-30°C, -10°C), but the markings on the physical tube start at the bottom and go up. A thermometer that reads from -127°C to +128°C could just shift everything up by 127: the bottom of the tube is labeled 0 (meaning -127°C), the middle is labeled 127 (meaning 0°C), and the top is labeled 254 (meaning +127°C). To read the real temperature, subtract 127 from the tube reading.\nE8M0 does exactly this. The real exponent can range from -127 to +128, but instead of storing negative numbers, we add 127 to everything. This offset is called the bias.\nstored_value = real_exponent + 127 real_exponent = stored_value - 127 scale = 2^(stored_value - 127) Some examples:\nStored value Calculation Real exponent Scale 120 120 - 127 -7 2^(-7) = 0.0078 124 124 - 127 -3 2^(-3) = 0.125 127 127 - 127 0 2^0 = 1.0 130 130 - 127 3 2^3 = 8.0 137 137 - 127 10 2^10 = 1024.0 The regular 32-bit floats that every computer uses also store their exponent with a bias of 127. It\u0026rsquo;s the same trick, and we\u0026rsquo;ll use that fact later to extract the exponent cheaply from a float\u0026rsquo;s binary representation.\n6. Computing the Scale: The Part That Took Hours This is where the engineering got interesting. Here\u0026rsquo;s the logic:\nFind the biggest absolute value in your block of 32 numbers. Call it amax. You want amax / scale to fit inside the FP4 range, which maxes out at 6.0. So you want scale ≈ amax / 6.0. But scale must be a power of 2, so you pick the nearest power of 2. Let\u0026rsquo;s work through a concrete example. Say your block of 32 values has a max absolute value of amax = 25.0. You want scale ≈ 25.0 / 6.0 ≈ 4.17. The nearest powers of 2 are 4.0 (below) and 8.0 (above). Which do you pick?\nFloor vs ceil: a real trade-off If you pick 4.0 (rounding down, called \u0026ldquo;floor\u0026rdquo;): 25.0 / 4.0 = 6.25. That exceeds the FP4 max of 6.0, so the value 25.0 gets clamped to 6.0. When you dequantize later, you recover 6.0 x 4.0 = 24.0 instead of 25.0. Error on this value: 1.0.\nNow take a smaller value in the same block, say 1.0. With scale 4.0: 1.0 / 4.0 = 0.25. That rounds to FP4 value 0.5 (the nearest representable value). Dequantized: 0.5 x 4.0 = 2.0. Error: 1.0.\nIf you pick 8.0 (rounding up, called \u0026ldquo;ceil\u0026rdquo;): 25.0 / 8.0 = 3.125, which rounds to FP4 value 3.0. Dequantized: 3.0 x 8.0 = 24.0. Error: 1.0. Same as floor for the max value.\nBut now the value 1.0 becomes 1.0 / 8.0 = 0.125. That rounds to FP4 value 0.0 (since 0.125 is below the midpoint 0.25 between 0.0 and 0.5). Dequantized: 0.0 x 8.0 = 0.0. Error: 1.0. The value 1.0 was completely erased to zero.\nWith the floor scale (4.0), that same value 1.0 became 2.0. Not perfect, but the information is preserved. With the ceil scale (8.0), it became 0.0. Gone.\nThe pattern: with a larger scale, small values get pushed toward zero more aggressively. The OCP standard uses floor. Sacrifice accuracy on the one or two extreme values in the block (they get clamped to 6.0), but give every other value in the block the best precision possible.\nGetting floor(log2) for free from the float bits In code, \u0026ldquo;floor of log2\u0026rdquo; is surprisingly easy to compute. You don\u0026rsquo;t need a logarithm function at all.\nEvery 32-bit float is stored in memory as 32 bits: 1 sign bit, 8 exponent bits, 23 mantissa bits. The computer represents every float as mantissa x 2^exponent, where the mantissa is always between 1.0 and 2.0. The exponent tells you the \u0026ldquo;order of magnitude\u0026rdquo; in powers of 2.\nThis means log2(value) = exponent + log2(mantissa). Since the mantissa is between 1.0 and 2.0, log2(mantissa) is between 0 and 1. So floor(log2(value)) is simply the exponent. It\u0026rsquo;s already sitting there in the float\u0026rsquo;s bits.\nThe exponent is stored with a bias of 127 (the same thermometer trick from section 5), so to extract it: read the exponent bits, subtract 127.\nTwo examples:\namax = 0.945\n0.945 in memory = 1.890 x 2^(-1) The mantissa is 1.890 (between 1.0 and 2.0, good) The exponent is -1, stored as -1 + 127 = 126 We extract 126, subtract bias: 126 - 127 = -1 floor(log2(0.945)) = -1 Check: 2^(-1) = 0.5 ≤ 0.945 \u0026lt; 1.0 = 2^0. Correct. amax = 25.0\n25.0 in memory = 1.5625 x 2^4 The mantissa is 1.5625 (between 1.0 and 2.0, good) The exponent is 4, stored as 4 + 127 = 131 We extract 131, subtract bias: 131 - 127 = 4 floor(log2(25.0)) = 4 Check: 2^4 = 16 ≤ 25.0 \u0026lt; 32 = 2^5. Correct. The full scale calculation unsigned int bits = __float_as_uint(max_abs); int max_exp = (int)((bits \u0026gt;\u0026gt; 23) \u0026amp; 0xFF) - 127; int scale_exp = max_exp - 2; int biased = scale_exp + 127; Line by line:\nLine 1: __float_as_uint(max_abs) reads the raw 32 bits of the float as an unsigned integer. We\u0026rsquo;re not converting the value, we\u0026rsquo;re reading the same bits with a different interpretation. Like reading a French word as if it were English: the letters are the same, but you interpret them differently.\nLine 2: (bits \u0026gt;\u0026gt; 23) \u0026amp; 0xFF shifts right by 23 positions to move the exponent bits (bits 23-30) down to bits 0-7, then masks with 0xFF to keep only those 8 bits. Subtracting 127 removes the bias. Result: floor(log2(max_abs)).\nLine 3: Subtract 2. This is the key step. We want scale ≈ amax / 6.0, and we\u0026rsquo;re working in powers of 2. The FP4 E2M1 format can represent values up to 6.0. How many powers of 2 does it take to reach 6? 2^2 = 4 is the largest power of 2 at or below 6. So the FP4 format \u0026ldquo;covers\u0026rdquo; 2 powers of 2 on its own.\nIf your input needs 4 powers of 2 to represent (say amax ≈ 25, so floor(log2(25)) = 4), and FP4 already covers 2, then the scale needs to cover the remaining 4 - 2 = 2 powers of 2. So scale = 2^2 = 4.\nLine 4: Add 127 to put the result back into the biased E8M0 format for storage.\nFull trace for amax = 0.945:\nmax_exp = floor(log2(0.945)) = -1 scale_exp = -1 - 2 = -3 biased = -3 + 127 = 124 scale = 2^(124 - 127) = 2^(-3) = 0.125 Check: 0.945 / 0.125 = 7.56, clamped to FP4 max (6.0) Dequantized: 6.0 x 0.125 = 0.75 (error: 0.195 on the max value) Full trace for amax = 25.0:\nmax_exp = floor(log2(25.0)) = 4 scale_exp = 4 - 2 = 2 biased = 2 + 127 = 129 scale = 2^(129 - 127) = 2^2 = 4.0 Check: 25.0 / 4.0 = 6.25, clamped to FP4 max (6.0) Dequantized: 6.0 x 4.0 = 24.0 (error: 1.0 on the max value) The wrong turns Getting the scale right took four attempts. Each failure taught me something about how the spec actually works.\nAttempt 1: frexpf The C standard library provides a function called frexpf. It takes a float and splits it into a mantissa and an exponent, similar to what the float\u0026rsquo;s bits encode internally. For example, frexpf(0.1576) returns m = 0.6304 and exp = -2, such that 0.6304 x 2^(-2) = 0.1576.\nI used it to compute floor(log2(amax / 6.0)) directly. The idea seemed clean: call frexpf, get the exponent, done.\nThe problem: frexpf defines its mantissa as being between 0.5 and 1.0, not between 1.0 and 2.0 like the float\u0026rsquo;s internal representation. This means frexpf\u0026rsquo;s exponent is always one higher than floor(log2(x)). For 0.1576: frexpf gives exp = -2, but floor(log2(0.1576)) = -3 (because 2^(-3) = 0.125 ≤ 0.1576 \u0026lt; 0.25 = 2^(-2)). You need to subtract 1 from frexpf\u0026rsquo;s exponent to get the floor.\nI did that, but juggling two different conventions (frexpf\u0026rsquo;s [0.5, 1.0) vs the float\u0026rsquo;s internal [1.0, 2.0)) made the code confusing. And that confusion led directly to the next mistake.\nResult: correct output, but fragile code.\nAttempt 2: accidental ceil While modifying the frexpf version, I accidentally removed the exp -= 1 adjustment. This changed the scale from floor to ceil, making it one power of 2 too large.\nWith the correct scale 0.125 (floor): 0.945 / 0.125 = 7.56, clamped to 6.0. With the wrong scale 0.25 (ceil): 0.945 / 0.25 = 3.78, rounds to FP4 value 4.0.\nEvery value in the block was divided by a scale twice as large as needed, so all the FP4 nibbles came out wrong. The test failed completely. Every single byte was different from the expected output.\nResult: complete failure. One power of 2 off changes everything.\nAttempt 3: the safety check I went back to the floor version (with exp -= 1) but added a safety net. After computing the scale, I checked: does amax / scale exceed 6.0? If so, bump the scale up by one power of 2.\nFor amax = 0.945, scale = 0.125: 0.945 / 0.125 = 7.56 \u0026gt; 6.0. So my code bumped the scale to 0.25. Stored value: 125 instead of 124.\nBut TorchAO (the reference that Tensara verifies against, as explained in section 4) expects 124. It uses floor and accepts the clamping. My \u0026ldquo;safety\u0026rdquo; produced a different scale, which changed every output byte.\nThis was the most frustrating attempt because preventing overflow feels like the right engineering instinct. But the spec deliberately allows overflow. The clamping is a feature, not a bug.\nResult: wrong answer. Scale was 125 everywhere, expected was 124.\nAttempt 4: direct bit extraction I abandoned frexpf entirely and extracted the exponent directly from the float\u0026rsquo;s binary representation:\nunsigned int bits = __float_as_uint(max_abs); int max_exp = (int)((bits \u0026gt;\u0026gt; 23) \u0026amp; 0xFF) - 127; No ambiguity about mantissa conventions. No off-by-one adjustments. The float\u0026rsquo;s exponent bits are literally floor(log2) with a bias. Extract and subtract.\nResult: correct scale on all test cases. Time to move on to the encoder.\n7. The Encoder: Turning a Float into 4 Bits The scale is computed and validated. We divide each value by the scale. The result is a float somewhere in the range that FP4 can represent. For example, if the original value is 0.6 and the scale is 0.125, we get 0.6 / 0.125 = 4.8.\nNow we need to choose which of the 8 FP4 magnitudes (0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0) is the closest to 4.8. In this case it\u0026rsquo;s 4.0 (nibble 6) or 6.0 (nibble 7). The midpoint between 4.0 and 6.0 is 5.0. Since 4.8 \u0026lt; 5.0, we pick 4.0. The nibble is 6.\nThat\u0026rsquo;s the encoder: for each scaled value, find the nearest FP4 value.\nThe midpoints between consecutive FP4 values are the decision boundaries:\nFP4 values: 0 0.5 1.0 1.5 2.0 3.0 4.0 6.0 Nibble: 0 1 2 3 4 5 6 7 | | | | | | | Midpoints: 0.25 0.75 1.25 1.75 2.5 3.5 5.0 If a value falls between two midpoints, the nearest FP4 value is unambiguous. The code checks each midpoint from top to bottom:\nif (abs_val \u0026gt; 5.0) nibble = 7; // represents 6.0 else if (abs_val \u0026gt;= 3.5) nibble = 6; // represents 4.0 else if (abs_val \u0026gt; 2.5) nibble = 5; // represents 3.0 else if (abs_val \u0026gt;= 1.75) nibble = 4; // represents 2.0 else if (abs_val \u0026gt; 1.25) nibble = 3; // represents 1.5 else if (abs_val \u0026gt;= 0.75) nibble = 2; // represents 1.0 else if (abs_val \u0026gt; 0.25) nibble = 1; // represents 0.5 else nibble = 0; // represents 0.0 The \u0026gt;= vs \u0026gt; puzzle Look carefully: some thresholds use \u0026gt; (strictly greater) and some use \u0026gt;= (greater or equal). For most values, it doesn\u0026rsquo;t matter. If your value is 2.7, it\u0026rsquo;s between midpoints 2.5 and 3.5, and both \u0026gt; and \u0026gt;= give the same answer: nibble 5 (represents 3.0).\nThe difference only shows up when a value lands exactly on a midpoint. When that happens, the value is equidistant from two FP4 values. You have to pick one.\nThe MXFP4 spec uses round-to-nearest-even: when there\u0026rsquo;s a tie, pick the nibble with an even index. This is the same rule that every computer uses for regular floating point arithmetic. It exists to prevent statistical bias: always rounding up (or always down) would systematically shift your values in one direction over large datasets.\nLet\u0026rsquo;s work through one midpoint in detail to see how this translates to code.\nTake the midpoint 0.75. It\u0026rsquo;s the point exactly between the FP4 value 0.5 (nibble 1) and the FP4 value 1.0 (nibble 2). If your scaled value is 0.74, it\u0026rsquo;s closer to 0.5, you pick nibble 1. If it\u0026rsquo;s 0.76, it\u0026rsquo;s closer to 1.0, you pick nibble 2. No problem.\nBut if it\u0026rsquo;s exactly 0.75, it\u0026rsquo;s at equal distance from both. Nibble 1 is odd, nibble 2 is even. Round-to-nearest-even says: pick nibble 2 (the even one). In the code, this means the threshold 0.75 must use \u0026gt;= so that the exact value 0.75 enters the nibble = 2 branch.\nNow take the midpoint 1.25. It\u0026rsquo;s between nibble 2 (even) and nibble 3 (odd). Round-to-nearest-even says: pick nibble 2 (the even one). This time we want the exact value 1.25 to NOT enter the nibble = 3 branch, so we use \u0026gt; instead of \u0026gt;=.\nThe rule for each midpoint:\nMidpoint Below Above Even pick Operator Example: exact midpoint becomes 0.25 nibble 0 (even) nibble 1 (odd) 0 \u0026gt; 0.25 becomes nibble 0 (value 0.0) 0.75 nibble 1 (odd) nibble 2 (even) 2 \u0026gt;= 0.75 becomes nibble 2 (value 1.0) 1.25 nibble 2 (even) nibble 3 (odd) 2 \u0026gt; 1.25 becomes nibble 2 (value 1.0) 1.75 nibble 3 (odd) nibble 4 (even) 4 \u0026gt;= 1.75 becomes nibble 4 (value 2.0) 2.5 nibble 4 (even) nibble 5 (odd) 4 \u0026gt; 2.5 becomes nibble 4 (value 2.0) 3.5 nibble 5 (odd) nibble 6 (even) 6 \u0026gt;= 3.5 becomes nibble 6 (value 4.0) 5.0 nibble 6 (even) nibble 7 (odd) 6 \u0026gt; 5.0 becomes nibble 6 (value 4.0) The pattern: \u0026gt;= when the upper nibble is even, \u0026gt; when it\u0026rsquo;s odd.\nThe debugging journey This detail took the longest to get right. The progression of errors shows how small the gap is between \u0026ldquo;almost correct\u0026rdquo; and \u0026ldquo;correct\u0026rdquo;.\nEncoder v1: \u0026gt;= everywhere My first instinct: use \u0026gt;= at every threshold. Clean, uniform, easy to read.\nTensara result: wrong answer, maximum difference 0.375. At this stage I was still debugging the scale at the same time, so this error came from both issues combined. But it told me the rounding was wrong on at least some boundary values.\nEncoder v2: lookup table with strict \u0026lt; I replaced the threshold checks with a brute-force approach: store all 8 FP4 values in an array, compute the distance from the input to each one, and pick the closest.\nconst float fp4_vals[8] = {0.f, 0.5f, 1.f, 1.5f, 2.f, 3.f, 4.f, 6.f}; uint8_t best = 0; float best_dist = x; for (int i = 1; i \u0026lt; 8; i++) { float dist = fabsf(x - fp4_vals[i]); if (dist \u0026lt; best_dist) { best_dist = dist; best = i; } } When two values are equally close (a tie), the \u0026lt; means the earlier candidate (lower nibble) wins.\nTensara result: wrong answer, maximum difference 0.0625. Much better, from 0.375 down to 0.0625. The remaining error was exactly one FP4 step (0.5) times the scale (0.125). One nibble was off by one position on a tie. The \u0026lt; tie-break always picks the lower nibble, but round-to-nearest-even sometimes wants the upper nibble (when the upper one has an even index).\nEncoder v3: round-to-nearest-even I went back to the threshold approach but analyzed each midpoint individually using the table above. Every tie goes to the even nibble: some thresholds use \u0026gt;=, others use \u0026gt;.\nTensara result: accepted, maximum difference 0.0. Seven characters changed from v1 (some \u0026gt;= became \u0026gt; and vice versa). But those seven characters are the difference between matching the spec and not.\n8. Packing: Two Values per Byte The third and last step. A nibble is 4 bits. A byte is 8 bits. We pack two FP4 values into one byte:\nbyte = (nibble_odd \u0026lt;\u0026lt; 4) | nibble_even; The even-indexed element goes in the low 4 bits (bits 0-3), the odd-indexed element goes in the high 4 bits (bits 4-7). For a block of 32 FP4 values, that\u0026rsquo;s 16 packed bytes.\nThe packing order matters. If you swap even and odd, the dequantization produces garbage. On my SM120 kernel, I use a different packing format (one FP4 in an 8-bit container with 4 bits of padding). For MXFP4 OCP standard, it\u0026rsquo;s two FP4 per byte with no padding.\n9. The First Kernel: One Thread per Block Now we need to write the actual GPU code. The three steps above (scale, encode, pack) describe what to do for one block of 32 elements. But a real matrix has over a million blocks. We need to run them all in parallel.\nI knew from the start that the first kernel would not be the fastest. The goal of a first implementation is never performance, it\u0026rsquo;s correctness. A simple kernel where each thread does everything alone is easier to debug: if the output is wrong, the bug is in the algorithm (scale or encoder), not in the parallelism. Once all tests pass, we know the logic is correct and we can optimize the thread-to-data mapping without worrying about confusing an algorithm bug with a parallelism bug.\nMapping the problem to the GPU The matrix has M rows and K columns. Each row is split into blocks of 32 elements. The total number of blocks is M x (K / 32). For a 4096 x 8192 matrix, that\u0026rsquo;s 4096 x 256 = 1,048,576 blocks. Each block is completely independent: it reads its own 32 values, computes its own scale, and produces its own 16 output bytes (32 FP4 values packed 2 per byte). No block needs to communicate with any other.\nThis maps naturally to a GPU. We launch one thread per block:\nint total_blocks = m * (k / 32); // 1,048,576 for a 4096x8192 matrix int threads_per_cuda_block = 256; int grid = (total_blocks + 255) / 256; kernel\u0026lt;\u0026lt;\u0026lt;grid, threads_per_cuda_block\u0026gt;\u0026gt;\u0026gt;(...); A note on naming: \u0026ldquo;block\u0026rdquo; means two different things here. In the quantization context, a \u0026ldquo;block\u0026rdquo; is 32 data elements that share a scale. In the CUDA context, a \u0026ldquo;block\u0026rdquo; is a group of threads that are assigned to the same processor (SM) on the GPU, share a fast local memory (shared memory), and can synchronize with each other. I\u0026rsquo;ll say \u0026ldquo;data block\u0026rdquo; for the 32 elements and \u0026ldquo;CUDA block\u0026rdquo; for the thread group.\nInside the kernel, each thread figures out which data block it\u0026rsquo;s responsible for:\nint bid = blockIdx.x * blockDim.x + threadIdx.x; int row = bid / num_blocks_per_row; int col_block = bid % num_blocks_per_row; int col_start = col_block * 32; Then the thread runs all three steps of the pipeline alone:\nReads 32 float values from global memory (128 bytes, one load at a time) Loops through all 32 to find the max absolute value, computes the scale Loops through all 32 again, divides each by the scale, encodes to FP4 Packs pairs into bytes and writes 16 bytes of output + 1 byte of scale This works. All four test cases passed. But the benchmarks told a clear story:\nMatrix size My kernel #1 (Triton) 1024 x 1024 23 μs 65 μs 2048 x 2048 49 μs 55 μs 4096 x 8192 285 μs 89 μs 8192 x 4096 282 μs 98 μs On small matrices, I was faster. On large matrices, 3x slower. The arithmetic is the same regardless of size. What changes is the memory access pattern.\nWhy it\u0026rsquo;s slow on large matrices The bottleneck on large matrices is memory access. GPUs execute threads in groups of 32 called warps. When all 32 threads in a warp read from consecutive addresses, the memory controller merges everything into a single 128-byte transaction. This is called coalesced access.\nIn my naive kernel, each thread handles a different data block. Thread 0 reads from column 0, thread 1 reads from column 32, thread 2 from column 64. These threads are in the same warp, but their addresses are 128 bytes apart (32 floats x 4 bytes). The memory controller can\u0026rsquo;t merge those into one transaction. Each read is served separately.\nOn small matrices, the cache absorbs the penalty. On large matrices, every scattered read hits global memory at full latency (hundreds of cycles), and performance collapses. 10. The Optimized Kernel: One Warp per Block The algorithm is correct. All tests pass. Now we change nothing about the logic. We only change how the work is distributed across threads so that memory access matches what the hardware can do efficiently.\nThe fix comes from noticing that a warp has exactly 32 threads, and a data block has exactly 32 elements. What if we assign one warp to one data block, where each thread handles exactly one element?\nint warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32; int lane = threadIdx.x \u0026amp; 31; float val = a[row * k + col_start + lane]; Thread 0 reads element 0, thread 1 reads element 1, thread 2 reads element 2, all the way to thread 31 reading element 31. All 32 addresses are consecutive in memory (4 bytes apart). The memory controller combines them into a single 128-byte transaction. One memory request instead of 32.\nWarp shuffle for the max In the naive kernel, one thread loops over 32 values to find the max. In the warp version, each thread holds one value. We need to find the max across all 32 threads without using shared memory.\nWarp shuffles let threads exchange values directly through their registers. The instruction __shfl_down_sync(mask, val, offset) sends each thread\u0026rsquo;s value to the thread offset positions below:\nfloat max_abs = fabsf(val); for (int offset = 16; offset \u0026gt; 0; offset \u0026gt;\u0026gt;= 1) max_abs = fmaxf(max_abs, __shfl_down_sync(0xFFFFFFFF, max_abs, offset)); max_abs = __shfl_sync(0xFFFFFFFF, max_abs, 0); First iteration (offset = 16): thread 0 receives thread 16\u0026rsquo;s value and keeps the max of the two. Thread 1 receives thread 17\u0026rsquo;s value. After this step, threads 0-15 each hold the max of a pair.\nSecond iteration (offset = 8): thread 0 receives thread 8\u0026rsquo;s value (which is already the max of a pair). Now threads 0-7 each hold the max of four values.\nAfter 5 iterations (log2(32) = 5), thread 0 holds the max of all 32 values. The final __shfl_sync broadcasts thread 0\u0026rsquo;s result to all 32 threads so they all know the scale.\nTotal cost: 5 register-to-register operations. No memory access, no synchronization barriers.\nWarp shuffle for packing Each even-numbered thread needs its odd neighbor\u0026rsquo;s nibble to pack two FP4 values into one byte:\nuint8_t partner_nibble = __shfl_xor_sync(0xFFFFFFFF, nibble, 1); This swaps values between adjacent pairs: thread 0 gets thread 1\u0026rsquo;s nibble, thread 2 gets thread 3\u0026rsquo;s. Then only even threads write the packed byte:\nif ((lane \u0026amp; 1) == 0) { uint8_t byte = (partner_nibble \u0026lt;\u0026lt; 4) | nibble; q[warp_id * 16 + lane / 2] = byte; } 16 threads write 16 consecutive bytes. Coalesced again.\nThe result Matrix size Naive Warp #1 (Triton) 1024 x 1024 23 μs 32 μs 65 μs 2048 x 2048 49 μs 39 μs 55 μs 4096 x 8192 285 μs 201 μs 89 μs 8192 x 4096 282 μs 290 μs 98 μs The warp version is slower on 1024x1024. The naive kernel launches 1,024 threads (one per data block). The warp kernel launches 32,768 threads (32 per data block). The extra threads have overhead: register allocation, warp scheduling. When the computation per data block is tiny, this overhead matters. On large matrices, the coalesced memory access more than compensates.\nThe final optimization was replacing the lookup-table encoder (a loop of 8 distance comparisons, each calling fabsf) with the direct threshold encoder (7 branches, no loop). This reduced the instruction count per thread and brought the overall average down.\nFinal result across all test cases: 72.48 μs. First place, ahead of the Triton kernel at 75.12 μs.\n11. What I Learned The spec matters more than the code. I spent more time understanding how TorchAO computes the scale and rounds the values than writing the actual CUDA kernel. A one-bit difference in the scale exponent changes every single output value.\nFloor vs ceil is a design choice, not a bug. The OCP spec deliberately uses floor for the scale, accepting that some values overflow and get clamped. This maximizes precision for the majority of values at the cost of clipping the extremes. My instinct was to prevent overflow, which was wrong.\nRound-to-nearest-even is everywhere. I knew this rule from standard floating point arithmetic but didn\u0026rsquo;t expect it in a 4-bit format with only 16 values. The difference between \u0026gt;= and \u0026gt; at each threshold is invisible in 99.999% of cases, but Tensara\u0026rsquo;s verification caught it.\nCoalesced memory access is as often the single biggest optimization. Going from \u0026ldquo;one thread reads 32 values\u0026rdquo; to \u0026ldquo;32 threads each read one value\u0026rdquo; was the difference between 285 μs and 72 μs on large matrices. The arithmetic was identical. Only the memory access pattern changed.\nCorrectness first, always. The naive kernel was not the fastest, but it was the easiest to debug. Every optimization I applied afterward changed zero lines of algorithm code. The scale, encoder, and packing logic stayed identical. Only the thread-to-data mapping changed.\nThere is still room to go faster. Vectorized loads (reading 4 floats at once with float4), better occupancy tuning, and register pressure optimization are all on the table. And the same approach applies to the next problems on the Tensara leaderboard: MXFP4 GEMM, NVFP4 quantization, and MXFP8 quantization.\n","permalink":"https://florianmattana.com/posts/mxfp4_article/","summary":"\u003ch2 id=\"why-i-did-this\"\u003eWhy I Did This\u003c/h2\u003e\n\u003cp\u003eI\u0026rsquo;m building an FP4 fused attention kernel for consumer Blackwell GPUs (SM120). That means I spend my days thinking about how to squeeze 32-bit numbers into 4 bits without losing too much information.\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://tensara.org\"\u003eTensara\u003c/a\u003e is a platform where you submit GPU kernels and compete on real hardware. They had an MXFP4 quantization problem with almost no submissions. I figured: I already know this format inside out on SM120, how hard can it be to write a standalone quantization kernel?\u003c/p\u003e","title":"I Wrote an MXFP4 Quantization Kernel and Ranked #1 on Tensara"},{"content":"1. Why FP4 Fused Attention on Consumer Blackwell? The attention mechanism in transformers scales quadratically with sequence length. On a consumer GPU with 12 GB of VRAM and 672 GB/s of memory bandwidth, that becomes a hard wall very quickly. The interesting thing about the RTX 5070 Ti (SM120, 46 SMs) is the raw throughput the Tensor Cores can deliver:\nPrecision Throughput FP16 123.5 TFLOPS INT8 246.9 TFLOPS FP4 ~474 TFLOPS That is roughly a 4x advantage going from FP16 to FP4, and since FP4 values are four times smaller, you also move four times less data through memory. On paper, that is a massive win for attention. If you can actually use the FP4 Tensor Cores.\nThe ecosystem support for FP4 on consumer Blackwell is recent and still thin. SageAttention3 does support SM120 and achieves over 1000 TOPS on the RTX 5090. But it is built on CUTLASS templates, which makes it very difficult to understand what happens between the data load and the Tensor Core instruction. If you want to know exactly how bytes are packed into MMA registers, how scale factors are distributed across lanes, or why a specific shared memory layout was chosen, the CUTLASS abstraction does not help you. The same is true for the emerging FlashInfer and vLLM backends that are adding SM120 paths.\nThere are also non-fused FP4 kernels for this hardware. For example VincentKaufmann fp4-cuda-kernel reaches about 143 TFLOPS. But non-fused means you compute QxK, write the full NxN score matrix to VRAM, read it back, apply softmax, write again, then compute the attention output. For 4096 tokens, that score matrix alone is 64 MB. On a 12 GB card, that is a dealbreaker.\nThe whole point of a fused kernel is to keep the intermediate score matrix in registers and never write it to global memory. That is what FlashAttention does for FP16 and I wanted to do the same thing for FP4.\nThis article documents the full process of building that kernel from scratch using inline PTX assembly on the RTX 5070 Ti. The goal is not to compete with SageAttention3 on throughput. It is to make every step of the FP4 fused attention pipeline visible and understandable: the MMA instruction, the fragment layout, the quantization, the scale factors, the softmax, the profiling. Most of this is undocumented for SM120 and had to be figured out empirically.\n2. Choosing the Programming Model I considered three approaches:\nOption A \u0026ndash; Inline PTX. Write the kernel in CUDA C++ and embed the Tensor Core MMA instructions as inline PTX assembly. This gives full control over register allocation, meaning I can guarantee the score matrix stays in registers.\nOption B \u0026ndash; CuTe (CUTLASS 3.x). Use NVIDIA template library. CuTe is powerful, but it abstracts away register placement. I was not confident I could prevent it from spilling the score matrix to shared or global memory, especially for a non-standard fused pattern.\nOption C \u0026ndash; Patch an existing INT8 kernel. Take a working fused INT8 attention kernel and swap the MMA instructions for FP4 equivalents. Faster to prototype, but brittle, the register layouts differ between INT8 and FP4 MMA, so the whole data flow would need reworking anyway.\nI went with Option A. The trade-off is clear: more manual work, more room for bugs, but absolute certainty about where every value lives. For a fused kernel where the entire point is keeping data in registers, that certainty is worth it.\n3. What the Kernel Needs to Do Before writing a single line of code, it helps to map out the full problem.\nA fused FP4 attention kernel has to solve five things in sequence, and each one has a hardware dependency that constrains everything that follows.\nLoad the input tiles. Q, K, and V are too large to fit in registers. They live in global memory and must be loaded into shared memory tile by tile. The size of each tile is bounded by the shared memory budget per SM, which on SM120 turns out to be 99 KiB and not the 128 KiB I initially assumed.\nQuantize on the fly. The Tensor Core does not consume float32. Before running the matrix multiply, each tile must be converted to FP4 E2M1 and its block scale factors must be computed. This has to happen in shared memory, which means a two-pass approach: load the tile as float32 first, compute the scale, then encode.\nRun the matrix multiply in FP4. The core operation is S = Q times K-transpose, computed with FP4 Tensor Cores. The score matrix S is 64 times 64 floats and must never be written to global memory.It lives entirely in registers across the warp throughout the computation. A warp is a group of 32 threads that execute together on the GPU, the fundamental unit of execution on Tensor Cores.\nApply online softmax. Softmax over a full row of S requires knowing the row maximum. But in the MMA output layout, each row is distributed across four threads. That forces a cross-thread reduction before every softmax step, using warp shuffle instructions.\nAccumulate the output. The final output O is computed as softmax(S) times V. This second matrix multiply accumulates incrementally as each column tile of S is processed, again never materializing the full score matrix.\nEach of these steps depends on knowing exactly which MMA instruction is available on SM120, what register layout it expects, and what quantization format it accepts.\nThat is what the next section is about.\n4. Picking the Right MMA Instruction This is where I hit the first major wall. I started by reading the PTX ISA docs looking for FP4 MMA instructions on Blackwell. The datacenter SM100 chips use tcgen05.mma, a new-generation instruction that operates on large tiles and uses a dedicated hardware unit called Tensor Memory. I assumed SM120 would have something similar.\nIt does not.\nAfter digging through CUTLASS issue #2800, a thread on the NVIDIA developer forums, and CUTLASS issue #3044, I pieced together the reality: SM120 uses the older Ampere-style warp-level mma.sync instructions. No Tensor Memory, no tcgen05. The specific instruction I need is:\nmma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e2m1.e2m1.f32.ue8m0\nLet me unpack that:\nmma.sync.aligned \u0026ndash; warp-synchronous, all 32 threads participate. kind::mxf8f6f4 \u0026ndash; the MX (microscaling) family that covers FP4/FP6/FP8. block_scale.scale_vec::1X \u0026ndash; each group of 32 FP4 values shares one 8-bit scale factor. I initially tried scale_vec::2X (one scale per 16 values, finer granularity) but it does not compile on SM120. Only 1X is supported, which means 6.25% overhead for the scale factors. m16n8k32 \u0026ndash; tile shape: 16 rows x 8 columns, with K=32 (32 FP4 values along the reduction dimension per instruction). f32.e2m1.e2m1.f32 \u0026ndash; FP32 accumulators, FP4 E2M1 inputs for both A and B matrices. ue8m0 \u0026ndash; the scale factor format (unsigned 8-bit exponent, no mantissa i.e., powers of two only). The register budget for one MMA call is roughly 7 registers per thread: 2 for the A fragment, 1 for B, and 4 for the FP32 accumulator. This assumption turned out to be wrong. The correct count is 10: 4 registers for A, 2 for B, and 4 for the accumulator. Discovering that cost several weeks of debugging, and section 11 explains how.\nThe fused kernel will process tiles of Q, K, V through shared memory, roughly 9 KB by my initial estimate. That number turned out to be wrong in two ways.\nFirst, the actual shared memory usage ended up closer to 49 KiB once the full quantization pipeline was in place. Second, the available budget on SM120 is not 128 KB as I\u0026rsquo;d assumed from the general Blackwell documentation, but 99 KiB. I found this while browsing open CUTLASS issues: issue #3144, titled \u0026ldquo;StageCountAutoCarveout assumes max family SMEM, breaks SM121 (99 KiB vs SM120 228 KiB)\u0026rdquo;, reporteda bug where CUTLASS was incorrectly assuming all SM12x GPUs share the same shared memory size. A contributor clarified in the thread that SM120 consumer Blackwell has 99 KiB, while SM100 datacenter Blackwell has 228 KiB. The distinction matters: at 49 KiB, the kernel fits within the 99 KiB optin budget, but only just.\nThe distinction matters: at 49 KiB, the kernel fits within the 99 KiB optin budget but only just. On any new GPU, checking the actual shared memory limit with nvidia-smi -q | grep \u0026quot;Max Shared Memory\u0026quot; before making assumptions about the budget would have saved me time. Section 8 covers the shared memory layout in detail.\nSection 7 covers the shared memory layout in detail.\n5. Testing the MMA Instruction (and Everything That Went Wrong) Before building the full fused kernel, I needed to verify that a single FP4 MMA instruction actually works. The idea is simple: load known values into registers A and B, run the MMA, and check that the FP32 accumulators contain the expected result.\nI wrote a minimal warp-synchronous kernel, launched with \u0026lt;\u0026lt;\u0026lt;1, 32\u0026gt;\u0026gt;\u0026gt; so that exactly one warp executes. The kernel fills the A and B registers with constant FP4 values, calls the MMA via inline PTX, and prints the four accumulator floats from thread 0.\nThe tile shape surprise My first attempt used m16n8k64 and I reasoned that since FP4 values are 4 bits each, 64 of them would fit in 32 bytes (the same as 32 FP8 values). The PTX assembler disagreed. It turns out the correct shape for FP4 on SM120 is m16n8k32: the k-dimension counts 8-bit containers, not individual FP4 values. Each container holds one FP4 nibble in bits 5-2, padded with zeros. This means you are effectively wasting half the container, but that is what the hardware expects.\nThe encoding bug that cost me a full day FP4 E2M1 encodes the value 1.0 as the 4-bit pattern 0b1000. The container is an 8-bit byte, and the nibble must sit in bits 5-2, not bits 3-0. That means the correct byte for 1.0 is 0x08: the pattern 0b00001000, with the nibble in the upper half of the low byte. If you place the nibble in bits 3-0 instead, you get 0x02, which the hardware reads as a completely different value.\nI initially filled every register with 0x22222222, four bytes of 0x22 packed together. I thought I was encoding 2.0 in every position. What I was actually doing was placing the nibble in the wrong bit positions. The hardware read each byte as 0b00100010, extracted the nibble from bits 5-2, which gives 0b1000 — the encoding for 1.0, not 2.0. So the MMA computed 32 multiplications of 1.0 times 1.0 and returned 32.0. I was expecting 128.0 (32 times 2.0 times 2.0 with scale 1.0).\nAfter staring at bit layouts for longer than I would like to admit, I realized the nibble was in the wrong position. Switching to 0x08080808, which places the 1.0 nibble correctly in bits 5-2 of each byte, and setting scale to 1.0, the MMA returned 32.0 exactly. That is 32 multiply-accumulates of 1.0 times 1.0. Correct.\nThe lesson: the FP4 container format is 00_SEMM_00 where the nibble occupies bits 5 through 2. Get the shift wrong and the hardware silently reads a different value with no error.\nThe inline PTX There is a reason this instruction appears as raw inline assembly rather than a clean C++ wrapper. The CUDA Core Compute Libraries (CCCL) expose cuda::ptx wrappers for many PTX instructions, which would normally be the right abstraction to use here. But at the time of writing, cuda::ptx does not provide wrappers for warp-level mma.sync on SM120. I exchanged with Federico Busato, who maintains CCCL at NVIDIA, on this exact gap. His read was that the wrappers would be useful but the decision was pending. I opened CCCL issue #8146 to track it. In the meantime, inline PTX is the only path. Here is the asm volatile block as I first wrote it:\nasm volatile( \u0026#34;mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4\u0026#34; \u0026#34;.block_scale.scale_vec::1X.f32.e2m1.e2m1.f32.ue8m0\u0026#34; \u0026#34; {%0, %1, %2, %3},\u0026#34; \u0026#34; {%4, %5},\u0026#34; \u0026#34; {%6},\u0026#34; \u0026#34; {%7, %8, %9, %10},\u0026#34; \u0026#34; %11,\u0026#34; \u0026#34; %12;\u0026#34; : \u0026#34;=f\u0026#34;(d0), \u0026#34;=f\u0026#34;(d1), \u0026#34;=f\u0026#34;(d2), \u0026#34;=f\u0026#34;(d3) : \u0026#34;r\u0026#34;(a0), \u0026#34;r\u0026#34;(a1), \u0026#34;r\u0026#34;(b0), \u0026#34;f\u0026#34;(c0), \u0026#34;f\u0026#34;(c1), \u0026#34;f\u0026#34;(c2), \u0026#34;f\u0026#34;(c3), \u0026#34;r\u0026#34;(scale_a), \u0026#34;r\u0026#34;(scale_b) ); This version has two registers for A and one for B. It is wrong. The correct instruction requires four A registers and two B registers. This assumption cost several weeks of debugging and is corrected in section 11. The block above is shown as written at this stage because it compiled and passed the isolated MMA test described below. The test was not thorough enough to catch the error.\nThe \u0026quot;=f\u0026quot; constraints are FP32 output registers, \u0026quot;r\u0026quot; are 32-bit integer input registers. The accumulator C is passed through as input (initialized to zero for the first call), and the result lands in D. The scale registers each pack four UE8M0 bytes into a single uint32.\nWith a0 = a1 = 0x08080808 (all 1.0), b0 = 0x08080808 (all 1.0), and scales set to 1.0 (0x7F7F7F7F, since UE8M0 byte 127 = 2^0 = 1.0), the result was 32.0 in every accumulator lane. That is 32 multiply-accumulates of 1.0 x 1.0, which is exactly right.\n6. Encoding FP32 to FP4 E2M1 Once the MMA worked with hardcoded constants, the next step was encoding arbitrary float values into FP4 E2M1 at runtime. The encoding function is covered in detail in a dedicated post on my MXFP4 quantization kernel. What follows here is a summary of the key points as they apply to this kernel. If you are already familiar with my previous article or have already experience with FP4 quantization, you can skip this section.\nThe FP4 E2M1 format FP4 has 1 sign bit, 2 exponent bits (bias 1), and 1 mantissa bit. That gives you exactly 16 representable values:\nBinary Value 0000 +0.0 0001 +0.5 0010 +1.0 0011 +1.5 0100 +2.0 0101 +3.0 0110 +4.0 0111 +6.0 1000 -0.0 1001 -0.5 1010 -1.0 1011 -1.5 1100 -2.0 1101 -3.0 1110 -4.0 1111 -6.0 The maximum representable magnitude is 6.0. Anything larger saturates.\nThe encoding function The device function takes a float, determines the closest FP4 magnitude through a chain of comparisons, assembles the 4-bit nibble, and shifts it left by 2 to place it in bits 5-2 of the 8-bit container:\n__device__ uint8_t encode_fp4_e2m1(float val) { uint8_t sign = (val \u0026lt; 0.0f) ? 1 : 0; float abs_val = fabsf(val); uint8_t encoded; if (abs_val \u0026gt;= 5.0f) encoded = 0x07; // 6.0 else if (abs_val \u0026gt;= 3.5f) encoded = 0x06; // 4.0 else if (abs_val \u0026gt;= 2.5f) encoded = 0x05; // 3.0 else if (abs_val \u0026gt;= 1.75f) encoded = 0x04; // 2.0 else if (abs_val \u0026gt;= 1.25f) encoded = 0x03; // 1.5 else if (abs_val \u0026gt;= 0.75f) encoded = 0x02; // 1.0 else if (abs_val \u0026gt;= 0.25f) encoded = 0x01; // 0.5 else encoded = 0x00; // 0.0 uint8_t nibble = (sign \u0026lt;\u0026lt; 3) | encoded; return nibble \u0026lt;\u0026lt; 2; // place in bits 5-2 } Quick sanity checks: encode_fp4_e2m1(1.0f) returns 0x08, encode_fp4_e2m1(-1.0f) returns 0x28, encode_fp4_e2m1(6.0f) returns 0x1C. To pack four encoded bytes into a uint32_t for the MMA register:\nuint32_t pack = e0 | (e1 \u0026lt;\u0026lt; 8) | (e2 \u0026lt;\u0026lt; 16) | (e3 \u0026lt;\u0026lt; 24); End-to-end test: encode 1.0 into every position, pack, run MMA -\u0026gt; 32.0. Encode 2.0 -\u0026gt; 128.0. Both matched. The encoding function was correct.\n7. Block Scaling: Why the Encoding Function Is Not Enough Here is the problem I ran into immediately: FP4 E2M1 maxes out at 6.0. If your input values are larger and in attention, they absolutely will be, everything above 5.0 clamps to 6.0 and you lose all relative differences. For example, encode_fp4_e2m1(12.0f) and encode_fp4_e2m1(10.0f) both return 0x1C (6.0). That is catastrophic for attention scores where the relative ordering is everything.\nThe solution: block scaling The MX format handles this with a shared scale factor per block. Before encoding, you divide every value in a block by a common scale factor, bringing them into the representable FP4 range. The MMA hardware then multiplies the scale back in during the accumulation for free, no extra instructions.\nTake a block of values: {12.0, 10.0, 3.0, -7.0}. The maximum absolute value is 12.0. If I choose a scale of 16, dividing gives {0.75, 0.625, 0.1875, -0.4375}. Now every value fits in FP4 range and critically, they encode to different FP4 values, preserving the relative ordering.\nWhy the scale must be a power of two The scale factor is stored in UE8M0 format: an 8-bit unsigned exponent with no mantissa. The actual scale value is 2^(byte - 127). This means only powers of two are representable. That is not a limitation it is a feature. Multiplying by a power of two is just a bit shift in the exponent of a floating-point number, so the Tensor Core applies the scale with zero additional cost.\nChoosing the right scale You want the smallest power of two that is greater than or equal to the maximum absolute value in the block. Rounding up avoids overflow (values that would exceed 6.0 after division). Rounding down would risk saturation for the largest values, exactly the ones you most need to preserve.\nThe formula: find max_abs across the block, compute exponent = ceil(log2(max_abs)), and the UE8M0 byte is exponent + 127.\nExample: block maximum is 12.0. ceil(log2(12.0)) = ceil(3.58) = 4. Scale = 2^4 = 16. UE8M0 byte = 4 + 127 = 131 = 0x83.\nThe device function __device__ uint8_t compute_scale_ue8m0(float* block, int size) { float max_abs = 0.0f; for (int i = 0; i \u0026lt; size; i++) { float a = fabsf(block[i]); if (a \u0026gt; max_abs) max_abs = a; } if (max_abs == 0.0f) return 127; // scale = 1.0 int exponent = (int)ceilf(log2f(max_abs)); return (uint8_t)(exponent + 127); } Validation I set up a test: A = 8.0 everywhere, B = 1.0 everywhere. Scale for A: max_abs = 8.0, ceil(log2(8)) = 3, UE8M0 byte = 130 = 0x82 (scale = 8). Scale for B: max_abs = 1.0, ceil(log2(1)) = 0, UE8M0 byte = 127 = 0x7F (scale = 1). After dividing A by 8, every element becomes 1.0, encoded as 0x08. The MMA computes 32 x (1.0 x 1.0) = 32.0, then the hardware applies the scales: 32.0 x 8 x 1 = 256.0.\nThe kernel printed 256.0. The block scaling pipeline works end to end.\nThe trade-off On SM120, each scale factor covers a block of 32 elements along the K dimension. If one element is an outlier, say 100.0 in a block where everything else is around 1.0 the scale gets set to 128 and all the small values round to zero after division. Smaller blocks would preserve more detail, but scale_vec::1X is all we get on this hardware. It is a real limitation, and for attention (where softmax creates sharp distributions), it matters. I will revisit this when profiling the full kernel.\nThe kernel now has three validated building blocks: encode_fp4_e2m1,compute_scale_ue8m0, and the inline PTX MMA call. The next step is loading Q, K, V tiles into shared memory and wiring everything together into the fused attention loop.\n8. Assembling the Kernel: Where Things Got Real The three building blocks worked in isolation. Encoding, scaling, MMA, all validated with hardcoded test values. Now I had to wire them together into an actual kernel that loads real data from VRAM, quantizes it on the fly, and runs the Tensor Core multiply. This is where the gap between \u0026ldquo;I have working pieces\u0026rdquo; and \u0026ldquo;I have a working kernel\u0026rdquo; became very real.\nThe first decision: how much shared memory My first attempt allocated two separate FP32 buffers in shared memory, one for Q and one for K. Each tile is 64 tokens times 128 dimensions = 8192 floats = 32 KB. Two of them: 64 KB. Add the quantized buffers and scales, and I was over 80 KB, which exceeds the 99 KiB optin budget established in section 3.\nThe fix was simple: Q and K are never needed in FP32 at the same time. Load Q as FP32, quantize it into Q_quant, then reuse the same staging buffer for K. One FP32 buffer instead of two. That brought the total down to about 49 KiB.\nBuffer Type Size Purpose staging float 32,768 B Reusable FP32 buffer for loading from VRAM Q_quant uint8 8,192 B Q tile after FP4 quantization K_quant uint8 8,192 B K tile after FP4 quantization Q_scales uint8 256 B One UE8M0 scale per 32 Q elements K_scales uint8 256 B One UE8M0 scale per 32 K elements Still only one block per SM due to register pressure from the accumulators but the shared memory budget is now accounted for.\nThe loading pattern that almost tripped me up The kernel uses 128 threads per block: 4 warps of 32 threads each, where each warp is responsible for 16 rows of the output tile. That gives exactly 4 times 16 = 64 rows, matching the Q tile size. The thread count is a direct consequence of the MMA tile geometry, not an arbitrary choice.\nThose 128 threads need to load 8192 floats from VRAM into shared memory. The standard approach is a strided loop: each thread starts at its own index and jumps by 128 each iteration. Thread 0 loads elements 0, 128, 256, and so on. Thread 1 loads 1, 129, 257. This guarantees that on every iteration, consecutive threads read consecutive addresses, which is the definition of coalesced access. A non-coalesced load would serialize the memory transactions and cost significant bandwidth.\nI initially wrote this with row and column indexing, computing row = idx / Bd and col = idx % Bd and then calculating the global address from there. It worked, but it was unnecessary complexity. Since Q is row-major and the tile is a contiguous block of rows, the linear index maps directly:\nfor (int k = 0; k \u0026lt; TILE_SIZE; k += NUM_THREADS) { int idx = tid + k; int g_idx = blockIdx.x * TILE_SIZE + idx; staging[idx] = Q[g_idx]; } __syncthreads(); blockIdx.x * TILE_SIZE is the offset for this block\u0026rsquo;s group of 64 tokens. No division, no modulo. Sometimes the clever approach is the dumb one.\nThe quantization two-pass problem I wanted to quantize the floats as I loaded them from VRAM, avoiding the staging buffer entirely. But block scaling killed that idea. To compute the scale factor for a block of 32 elements, you need the maximum absolute value across all 32. During the strided load, a single thread does not see all 32 elements of the same scaling block because they are spread across multiple iterations. Thread 0 sees elements 0, 128, 256, never elements 1 through 31.\nSo quantization has to be a separate pass after loading. The staging buffer exists specifically because of this dependency. First load everything as FP32, barrier, then quantize in a second pass where each thread handles complete 32-element blocks:\nfor (int i = tid; i \u0026lt; NUM_SCALE_BLOCKS; i += NUM_THREADS) { uint8_t scale = compute_scale_ue8m0(\u0026amp;staging[i * BLOCK_ELEMENT]); Q_scales[i] = scale; float scale_f = exp2f((float)(scale - 127)); for (int j = 0; j \u0026lt; BLOCK_ELEMENT; j++) { float val = staging[i * BLOCK_ELEMENT + j] / scale_f; Q_quant[i * BLOCK_ELEMENT + j] = encode_fp4_e2m1(val); } } __syncthreads(); 256 scaling blocks, 128 threads, 2 blocks per thread. Each thread processes its blocks sequentially, scanning for the max, computing the scale, dividing, and encoding all 32 values. It is not fast, there is a lot of branching in encode_fp4_e2m1 and the inner loop is purely sequential, but it works. Optimization comes later.\nK gets the exact same treatment: load into staging (overwriting Q\u0026rsquo;s FP32 data, which is already quantized and safe in Q_quant), barrier, quantize into K_quant and K_scales, barrier.\nThe MMA fragment loading: the part I could not figure out alone This is where I spent the most time. The MMA instruction expects each of the 32 threads in a warp to hold specific bytes from the A and B matrices in its registers. Not just any bytes but the exact bytes that correspond to that thread\u0026rsquo;s position in the matrix tile. The mapping is defined in the PTX ISA fragment layout tables. For fragment A, a [16 x 32] slice of Q, each thread holds registers that pack FP4 values in their 8-bit containers. The lane ID determines which row and which K-column range the thread is responsible for. The registers are not contiguous in memory: they cover two different K-column ranges separated by a stride of 16 columns.\nFor fragment B, the K matrix accessed as its transpose, each thread holds registers covering one token and one range of head-dimension positions. The scale factors follow the same lane-based assignment: each thread looks up the scale for the block it loaded and passes it to the MMA.\nWhat I thought I understood at this stage turned out to be wrong in almost every detail. The register count, the lane grouping, the K stride, the scale index: all of it had to be corrected later through empirical testing. The code I had at this point compiled and passed isolated tests with hardcoded values, which masked the errors. Sections 11 through 14 document what was wrong and how it was fixed.\nThe MMA loop With fragments loaded, the MMA itself is straightforward. The outer loop covers 8 column tiles of S, the inner loop accumulates 4 K-chunks along the head dimension:\nfor (int n_tile = 0; n_tile \u0026lt; N_TILES; n_tile++) { float acc[ACC_PER_THREAD] = {0.f}; for (int k_tile = 0; k_tile \u0026lt; K_TILES; k_tile++) { // load fragments and scales // call asm volatile MMA } } The accumulators are both input and output. On the first K-chunk they are zero. Each subsequent MMA adds its partial product. After 4 iterations, the accumulators hold the complete dot products for this thread\u0026rsquo;s slice of the output tile. Four warps, 8 column tiles each, 4 accumulators per thread: the full [64 x 64] score matrix lives entirely in registers. No global memory is touched for the intermediate result. That is the whole point of the fused kernel.\nThe code shown here is intentionally simplified. The correct register count, lane assignments, and index formulas are established in sections 11 through 14 after the correctness failures described in this section 8.\n9. The First Correctness Test The result Section 8 ended with the MMA loop assembled and the score matrix living in registers. I ran the correctness test against a float32 CPU reference.\nCosine similarity: 0.06.\nCosine similarity measures the angle between two output vectors. A value of 1.0 means the GPU and CPU outputs point in exactly the same direction. A value of 0 means they are completely uncorrelated. At 0.06, the GPU was producing noise.\nWhat the number does not tell you The three validated components were not the problem. The encoding function worked. The block scaling worked. The isolated MMA test from section 4 still printed 256.0 correctly. The problem had to be in how the full kernel assembled those pieces together.\nThe most obvious candidate was the fragment loading. In section 7, I described loading a0 and a1 for the A fragment and b0 for B. That description was my best understanding at the time. It was wrong in two ways I had not yet discovered.\nThe issue with cosine as a diagnostic is that it gives you one number. It tells you whether the final answer is right or wrong. It tells you nothing about where it went wrong.\n10. The Fragment Layout Problem No documentation The fragment layout — the mapping between lane IDs and matrix positions is defined by the hardware for every MMA instruction variant. The hardware defines precisely which matrix positions belong to which thread.\nFor FP16 and BF16 MMA variants, NVIDIA documents these layouts with diagrams in the PTX ISA specification. For mxf8f6f4 m16n8k32 on SM120, those diagrams do not exist. The instruction is listed, the operand counts are given, but the per-lane mapping is absent.\nI posted a question on the NVIDIA developer forums on March 19. No response.\nThe guessing phase So I started guessing. The PTX ISA gives you the operand count for the instruction: 4 registers for A, 2 for B, 4 for the accumulator. But it does not tell you which matrix row or column each register corresponds to for a given lane. That mapping is what I was guessing.\nA lane is a thread within a warp, numbered 0 to 31. For a 16x32 tile of Q spread across 32 threads, each lane owns a specific slice. The question is: which slice?\nI tried lane % 16 for the row index and lane / 16 for the K-column group. I tried different strides between registers: 4, 8, 16 columns apart. Each attempt compiled without errors and produced a different cosine between 0.01 and 0.09. A cosine of 0.01 and a cosine of 0.09 both mean wrong. There was no gradient to follow. I could not tell if I was getting closer to the correct layout or further away, because the only signal I had was a single float that told me nothing about where the error was.\nThe problem was not the guesses. The problem was the test.\n11. Finding the Right Diagnostic The identity matrix test I replaced the attention test with a simpler one designed to give precise information. I set Q and K to the 64x64 identity matrix, zero-padded to the [64, 128] shape the kernel expects. The expected output is S = Q times K-transpose = I_64. Every diagonal entry is 1.0, every off-diagonal entry is 0.\nWhen you multiply a matrix by its own transpose, the result at position [i][j] is the dot product of row i with row j. For the identity matrix, row i contains a single 1.0 at position i and zeros everywhere else. The dot product of row i with row j is 1.0 only when i equals j, and 0 in every other case. That is why the result is the identity matrix again.\nI also isolated the fragment loading from everything else by writing a separate debug kernel that loads directly from global memory, skips shared memory entirely, and hardcodes all scales to 1.0. One variable at a time.\nRunning this with my best guess at the fragment layout: 20 non-zeros instead of 8. Columns 0 through 7 of the identity matrix should produce exactly 8 ones on the diagonal. There were 12 phantom values in wrong positions.\nWhat it reveals This test has two properties the attention test does not. First, 1.0 is exactly representable in FP4 E2M1, so quantization cannot explain wrong results. Second, each non-zero in S comes from exactly one dot product. If S[2][5] is non-zero when it should be zero, it means the threads responsible for row 2 of Q and column 5 of K loaded data they were not supposed to load. The wrong value points directly to the wrong lane.\nFor the first time, I could see where the problem was.\n12. Fixing the A Fragment The wrong register count We are inside the k_tile loop, at the moment where each thread loads its slice of Qinto registers before calling the MMA. This is the fragment load: the step thatdistributes the A matrix across the 32 threads of the warp so the Tensor Core can consume it. The A tile here is a [16 x 32] slice of Q: 16 token rows, each spanning 32 elementsalong the head dimension. 512 values in total, divided exactly across 32 threads.Each thread holds 4 registers of 4 FP4 elements each: 32 × 4 × 4 = 512. The distribution is exact. There is no slack.\nMy original code loaded two registers per thread. That covered only half the A tile. The other 256 elements had no owner. The hardware read those positions from whatever happened to be in the registers at that point, uninitialized memory, and computed dot products against it. No error was raised. The kernel ran to completion and wroteoutput that looked like real attention scores.\nThe wrong lane grouping The register count was not the only problem. Even with the right number of registers, the data inside them was wrong. The K-column each thread loads depends on its lane ID. My formula used lane / 16, which divides the 32 lanes into two groups of 16. Every thread in the first grouploaded the same K-column positions, and every thread in the second group loaded another identical set. Two groups of 16 threads duplicating each other\u0026rsquo;s work means only two distinct K-column ranges were ever read. The other half of the A tile, the K-columns that no group was assigned to, was never touched.\nThe correct formula is lane % 4, which creates four groups of eight threads. Each group gets a different K-column subgroup, and together the four groups cover all 32 K-columns of the A tile without overlap or gap.\nThe fix With lane / 4 for the row and lane % 4 for the K-column subgroup, each of the32 threads gets a unique assignment: 8 row positions times 4 K-column groups. No two threads duplicate each other\u0026rsquo;s work and no position in the A tile goes unloaded. The four registers follow a pattern that is worth making explicit because it is easy to get wrong. The first two registers, a0 and a1, cover the same K-column range but different rows: a0 for row0, a1 for row0+8. The second two registers, a2 and a3, shift the K-columns by 16 and repeat the same row pattern: a2 for row0, a3 for row0+8. The row alternates across registers rather than incrementing, which is the opposite of what feels natural.\nAfter this fix: 8 non-zeros, correct diagonal positions for columns 0 through 7.\n13. Fixing the B Fragment The same class of error With the A fragment correct, I extended the debug kernel to loop over all 8 column tiles and check the full 64x64 score matrix. The non-zero count dropped back to wrong values. A different fragment, the same mistake.\nB is the K matrix, accessed as its transpose by the MMA .col modifier. Each thread loads a slice of K corresponding to one token and one range of head-dimension positions. My original code used lane / 16 to select the token and lane % 16 for the head-dimension offset. Two groups of 16 lanes, each duplicating the other within its group. Only two distinct tokens were ever loaded per tile. The rest of K was uninitialized memory.\nThe fix The correction follows the same logic as A. lane / 4 selects the token within the column tile, giving 8 distinct token assignments. lane % 4 selects the head-dimension subgroup, giving 4 distinct K-column ranges. The two B registers, b0 and b1, cover the same token at K-column positions 16 apart, mirroring the stride pattern from the A fragment.\nConfirmation After this fix: 64 non-zeros, S equals I_64 exactly.\nA second test with Q and K filled from {-2, -1, 0, 1, 2}, values all exactly representable in FP4 E2M1, gave cosine similarity 1.000000 and max absolute error 0.000 across all 4096 elements of S. The fragment loading was correct.\nThe identity matrix test took less than an afternoon to build. The previous weeks of guessing produced nothing because I had no way to see where the errors were. Once I could observe which specific cells in S were wrong, both fixes took less than an hour.\n14. The Remaining Bugs With the fragment layout validated on clean data, I moved the corrected loading code into the full kernel that uses shared memory and quantization. The cosine dropped back to 0.06. A different set of problems, same symptom.\nThe K stride The fragment loading reads from K_quant, the quantized K tile in shared memory. K is stored row-major: K_quant[token * Bd + head_dim], where Bd is 128. My indexing used token * BQ instead, treating BQ (64) as the stride between tokens. Every K access was reading from roughly half the correct address in shared memory. The kernel was computing dot products between Q rows and bytes from the wrong tokens.\nFix: replace BQ with Bd in the K stride.\nCosine after fix: still 0.06. The next bug was masking any improvement.\nThe K scale index Each 32-element block of K has a scale factor stored in K_scales. The index into that array is token * (Bd / BLOCK_ELEMENT) + k_tile. The stride between tokens is Bd / BLOCK_ELEMENT, which is 4. My code used BQ / BLOCK_ELEMENT instead, which is 2. Every scale lookup was landing at the wrong position, applying an incorrect normalization factor to the K values before the MMA.\nFix: replace BQ with Bd in the scale stride.\nCosine after fix: still 0.06. Two more bugs waiting.\nThe scope errors q_row0 holds the output row index for this thread. It was declared inside the k_tile loop but needed by the output write that happens after both loops. Out of scope, the compiler silently reused whatever value was on the stack, writing output to arbitrary row positions.\nout_col was declared twice: once inside the n_tile loop and once in the output write. The second declaration shadowed the first. Both used the same formula so the values were identical, but the shadowing masked a structural problem that would have caused bugs if the formula had ever changed.\nBoth variables needed to be declared once before both loops.\nCosine after fix: still 0.06. One more.\nThe lane collision The output write used q_row0 = warp_id * MMA_M + (lane % 16). With lane % 16, lane 0 and lane 16 both compute q_row0 = 0 and write to the same memory address. Lane 4 and lane 20 both write to row 1, lanes 8 and 24 to row 2, and so on. Every row was being written twice by two different threads, one overwriting the other.\nThe correct formula is lane / 4, which gives each row a unique group of four lanes. No two threads write to the same address.\nCosine after fix: 0.19. The output was wrong but no longer completely random.\nThe V accumulation With S computed correctly in registers, the kernel accumulates the attention output as O = softmax(S) times V. Each thread is responsible for two output columns. For each column tile of S, it needs to sum the softmax-weighted V rows across all eight K-tokens in that tile.\nThe problem: each thread only holds the softmax weights for two of those eight tokens. The other six are in neighboring threads. My first implementation used the butterfly reduce to collect the V contributions, the same pattern used earlier for the row maximum. That was the wrong operation here.\nThe butterfly adds values from neighboring threads. Thread 0 accumulates for output column 0. Thread 1 accumulates for output column 2. After the butterfly, thread 0 was adding its column-0 contribution to thread 1\u0026rsquo;s column-2 contribution. Different output dimensions mixed together. The result had no meaning.\nThe correct approach is different: each thread uses __shfl_sync to fetch the softmax weights from its three neighbors explicitly, then multiplies each neighbor\u0026rsquo;s weights against the V values for its own output columns. The accumulation stays local to each thread\u0026rsquo;s assigned dimensions. No cross-dimension mixing.\nCosine after fix: 0.81.\nThe race condition The kernel was launched with four blocks of 128 threads each. One block already covers the full 64-row Q tile: four warps times 16 rows each. Four blocks meant four independent groups of threads all writing to the same output array simultaneously, with the last writer winning arbitrarily.\nOne block is enough.\nCosine after fix: 0.81 confirmed. The remaining gap from 1.0 is quantization noise, not a correctness issue. Section 17 confirms this.\n15. The Scale Layout The same problem The MMA instruction takes one uint32_t per thread for scale_a and one for scale_b. But the A tile has 16 rows and each row needs its own scale factor. The B tile has 8 columns and each column needs its own scale. One register per thread cannot hold all of that.\nThe hardware must be reading specific bytes from specific lanes, distributing the scale responsibility across the warp just as it distributes the fragment data. But the PTX ISA does not document this mapping for mxf8f6f4 m16n8k32 on SM120.\nThe same situation as the fragment layout. The same approach.\nThe probing method I filled Q and K with all 1.0 and set all scales to 127, which is 2^(127-127) = 1.0. With all inputs equal to 1.0 and scale equal to 1.0, every MMA output is 32.0: the sum of 32 multiplications of 1.0 times 1.0.\nThen I ran the kernel 32 times. Each run set exactly one lane\u0026rsquo;s scale_a to 128, which is 2^(128-127) = 2.0, while all other lanes kept 127. If lane L\u0026rsquo;s scale register controls row R of the output, then row R doubles from 32.0 to 64.0. I recorded which rows changed for each target lane.\nThe results Lane condition Row affected lane % 4 == 0 lane / 4, rows 0 through 7 lane % 4 == 1 lane / 4 + 8, rows 8 through 15 lane % 4 == 2 no effect lane % 4 == 3 no effect The same probing on scale_b: only lanes where lane % 4 == 0 have any effect, and lane L controls column lane / 4.\nThis is consistent with the A fragment structure established in section 11. The register a0 covers row0 and its scale is supplied by the lane with lane % 4 == 0. The register a1 covers row0+8 and its scale comes from lane % 4 == 1. Lanes 2 and 3 load fragment data normally through their A registers, but the hardware does not read their scale values. They are silently ignored.\nThis also resolved a secondary issue from the original code. The kernel had been packing the scale byte four times into the uint32_t: sa | (sa \u0026lt;\u0026lt; 8) | (sa \u0026lt;\u0026lt; 16) | (sa \u0026lt;\u0026lt; 24). This happened to work because the hardware reads byte 0 of the register, and packing the same byte four times keeps byte 0 correct. Once the actual mapping was clear, the cast became a direct (uint32_t)sa. Same result, explicit intent.\n16. Online Softmax and the Accumulation of V Why a warp reduction is unavoidable With correct fragment loading, scale indexing, and output addressing, the kernel was computing S = Q times K-transpose correctly in registers. The second half of the fused attention is Out = softmax(S) times V.\nWriting S to global memory, running softmax separately, and reading it back would defeat the entire purpose of the fused kernel. Instead, I used the online softmax algorithm introduced in the Flash Attention paper. The idea is to maintain a running state that updates as each column tile of S is computed, so the softmax normalization is applied incrementally without ever materializing the full score matrix.\nThe D output layout from the MMA places a row\u0026rsquo;s scores across four threads. The eight scores for a complete row are split: thread 0 holds columns 0 and 1, thread 1 holds columns 2 and 3, thread 2 holds columns 4 and 5, thread 3 holds columns 6 and 7. To compute the row maximum needed for numerically stable softmax, those four threads must communicate.\nA butterfly reduce is a communication pattern where threads exchange values in log2(N) rounds, such that after the rounds every thread holds the result of the operation across all N threads. Running __shfl_xor_sync twice with masks 1 and 2 covers all four pairings in two rounds. After these two rounds, all four threads in the group hold the global maximum of their row. This warp reduction is not a choice. It is a direct consequence of the fragment layout.\nThe online state The running state for each row has three components: m, the maximum score seen so far; l, the sum of exponentials seen so far; and O, the unnormalized output accumulated so far. When a new tile arrives: new_m = max(m, tile_max) alpha = exp(m - new_m) new_l = alpha * l + sum(exp(score - new_m) for score in tile) new_O = (alpha * l * O + weighted V contribution) / new_l\nalpha is the rescaling factor. If the new tile contains a score larger than anything seen before, alpha is less than 1 and the previous accumulator shrinks proportionally. If the maximum does not change, alpha is 1 and the old output is unchanged.\nThe V accumulation bug The V accumulation bug and its fix are described in section 13. After the fix, the cosine with random inputs reached 0.81.\nThe race condition The kernel was launched with four blocks and 128 threads per block. Four blocks meant four independent groups of four warps, all writing their outputs to the same memory addresses with no coordination. Each block computed correct results and then overwrote what the previous block had written. One block is enough. Changing \u0026lt;\u0026lt;\u0026lt;4, NUM_THREADS\u0026gt;\u0026gt;\u0026gt; to \u0026lt;\u0026lt;\u0026lt;1, NUM_THREADS\u0026gt;\u0026gt;\u0026gt; fixed it.\nValidation At this point every correctness bug had been fixed. The V accumulation and race condition bugs are documented in section 14 — they were resolved before the softmax was wired in. With those fixes in place, the full pipeline from Q and K loading to softmax to V accumulation produced: cosine similarity : 1.0000 PASS ref[0..7] : -0.446 -0.879 -0.450 0.511 0.940 0.968 -0.947 -0.049 out[0..7] : -0.446 -0.879 -0.450 0.511 0.940 0.968 -0.947 -0.049\nBit-exact. This confirms that the entire pipeline is correct: FP4 quantization, block scaling, MMA fragment loading, online softmax, and V accumulation all produce the right answer when given inputs that are exactly representable in FP4 E2M1.\nThe 0.81 cosine observed earlier with random inputs in [-1, 1] is the intrinsic precision cost of MXFP4 at scale_vec::1X granularity. FP4 E2M1 has only eight representable magnitudes. With one scale covering 32 elements, a single outlier sets the scale for the entire block and the remaining values lose resolution. The CPU reference operates on the original float32 values, so the comparison is unfair. The kernel is correct. The 0.81 is an architectural constraint, not a bug.\n17. The K Tile Loop The kernel validated in section 16 had one hard limitation: K was loaded from a hardcoded offset, meaning it only ever saw the first 64 tokens of the key sequence. For any real attention computation, K can have thousands of tokens. The kernel needed a loop.\nThe change is conceptually simple. Instead of loading one K tile before the MMA loop, the outer structure becomes a loop over sequence tiles. For each tile, the kernel loads 64 rows of K into shared memory, quantizes them, runs the full MMA and softmax update, then moves to the next tile. The online softmax state — m, l, and O is declared before the loop and persists across all tiles. Each tile\u0026rsquo;s scores are folded into the running state via the rescaling factor alpha.\nOne detail worth noting: the V access index must account for the tile offset. When accumulating the attention output, the V row index is not just the local token position within the tile but seq_tile * BQ + local_token. Without that offset, every tile reads from the beginning of V regardless of which K tokens it just scored.\nA __syncthreads() at the end of each iteration ensures the staging buffer is free before the next tile\u0026rsquo;s load overwrites it.\nValidation with two test cases confirms correctness. With seq_k = 64 (single tile, regression test): cosine 1.0000. With seq_k = 128 (two tiles, first real test of the loop): cosine 1.0000.\nThe kernel now processes attention over arbitrary key sequence lengths, in multiples of 64 tokens.\n18. Softmax Scaling The attention formula is softmax(Q×Kᵀ / sqrt(d)) × V. The division by sqrt(d) was missing from the kernel until this point.\nWithout it, the scores in S grow with the head dimension. Each score is a dot product of two vectors of length d. If Q and K have values around 1, the scores are on the order of sqrt(d) for d=128, that is around 11. Feeding large values into softmax pushes it toward saturation: the maximum score gets a weight close to 1 and everything else collapses toward 0. The attention output becomes a near-copy of the V row corresponding to the single highest score, losing all the nuance of the weighted average.\nDividing by sqrt(d) brings the scores back to order of magnitude 1 before the softmax, keeping the output distribution balanced.\nIn the kernel, this is a single multiply applied to the four accumulators immediately after the inner k_tile loop and before the softmax reduction:\nfor (int i = 0; i \u0026lt; ACC_PER_THREAD; i++) acc[i] *= softmax_scale; // softmax_scale = 1/sqrt(Bd) = 1/sqrt(128) The CPU reference was updated to apply the same scaling, and both test cases pass at cosine 1.0000.\n19. Multi-Head Attention and Arbitrary Head Dimensions The limitation Up to this point, the kernel had three hard constraints that made it unusable on real models. The head dimension was fixed at 128. The output covered only 8 columns per thread, which happened to match the previous test setup but was wrong for any real head dimension. And the kernel processed a single head with no concept of batch or head index.\nTemplate parameter for head dimension Different models use different head dimensions. GPT-2 uses 64, LLaMA uses 128, some recent models use 256. Hardcoding 128 excludes most of them.\nThe solution is a C++ template parameter. Instead of a fixed constant, the kernel becomes template\u0026lt;int HEAD_DIM\u0026gt;. The compiler generates a separate binary for each instantiation: fused_fp4_attention\u0026lt;128\u0026gt; and fused_fp4_attention\u0026lt;64\u0026gt; are two distinct kernels, each with their own compile-time constants for tile sizes, loop bounds, and register counts. No runtime branching, no overhead.\nThe only constraint is that HEAD_DIM must be a multiple of 32, which is the MMA reduction dimension. Values like 64, 96, 128, 160, and 256 all work.\nThe output accumulator bug Fixing the head dimension revealed a deeper bug. The original kernel kept two scalar accumulators per thread for the V output: O0_c0 and O0_c1. That was correct when the output had 8 columns total, but wrong for any real head dimension.\nFor HEAD_DIM=128, each thread is responsible for 128 / 4 = 32 output column pairs, not 2. The previous kernel was writing 2 values and leaving 126 columns at zero.\nThe fix replaces the two scalars with an array O0[V_COL_TILES * 2] where V_COL_TILES = HEAD_DIM / MMA_N. For HEAD_DIM=128 that is 32 floats per row per thread. The V accumulation becomes a loop over all output column tiles, and the online softmax rescaling (alpha) must be applied to every element of that array at each update step.\nMulti-head and batching Each block processes one (batch, head) pair independently. The mapping is:\nint batch_idx = blockIdx.x / heads; int head_idx = blockIdx.x % heads; Each block computes its own offset into the Q, K, V, and Out tensors and works without any coordination with other blocks. The launch becomes \u0026lt;\u0026lt;\u0026lt;batch * heads, NUM_THREADS\u0026gt;\u0026gt;\u0026gt;.\nValidation Six test cases confirm correctness across configurations:\nConfig Result head_dim=128, seq_k=64, 1 head cosine 1.0000 PASS head_dim=128, seq_k=128, 1 head cosine 1.0000 PASS head_dim=64, seq_k=64, 1 head cosine 1.0000 PASS head_dim=64, seq_k=128, 1 head cosine 1.0000 PASS head_dim=128, seq_k=128, batch=1 heads=4 cosine 1.0000 PASS head_dim=128, seq_k=128, batch=2 heads=4 cosine 1.0000 PASS The kernel now handles arbitrary head dimensions, multiple heads, and batched inputs.\n20. First Benchmark and the NCU Diagnosis The numbers With the kernel functionally complete, I ran a first benchmark on the RTX 5070 Ti. Configuration: batch=1, heads=32, seq_q=64.\nhead_dim seq_k kern_ms TFLOPS BW GB/s 128 128 0.072 1.87 87.8 128 512 0.255 2.11 74.0 128 1024 0.530 2.03 67.3 64 128 0.037 1.82 85.2 64 512 0.102 2.62 92.1 64 1024 0.192 2.80 92.9 The RTX 5070 Ti has a theoretical FP4 throughput of 474 TFLOPS. We are at 2.8, which is 0.6% utilization. Before optimizing, I needed to know exactly where the time was going.\nWhat NCU said The first metric that jumped out was \u0026ldquo;No Eligible\u0026rdquo; at 95.23%. This means the warp scheduler found no warp ready to execute 95% of the time. The GPU was spending almost all its time waiting.\nA warp scheduler looks for eligible warps — warps that have their input data ready and can execute an instruction. When none are eligible, the SM is idle. High \u0026ldquo;No Eligible\u0026rdquo; is the definition of a latency-bound kernel.\nThe occupancy was 7.94% against a theoretical maximum of 8.33%. The reason: Block Limit Shared Mem : 2 blocks per SM Block Limit Registers : 8 blocks per SM\nThe shared memory was the binding constraint. With 41 KB of static shared memory per block and 99 KiB available per SM, only 2 blocks could fit per SM simultaneously. With 2 blocks of 2 warps each, the SM had 4 active warps. A GPU needs roughly 32 warps per SM to fully hide memory latencies.\nThe long scoreboard stall was at 81%. Warps were stalling on data from global memory. The culprit was V: the V accumulation loop was accessing V directly from global memory inside a double loop, generating thousands of uncoalesced accesses per thread.\nThe fix: V in shared memory The solution was to load each V tile into shared memory before the MMA loop, exactly as K is loaded. This adds 32 KB to the shared memory budget for HEAD_DIM=128, bringing the total to about 80 KB. The double buffering for K was removed to fit within the 99 KiB limit.\nAfter this change, the TFLOPS roughly doubled:\nhead_dim seq_k before after 128 1024 1.26 TFLOPS 2.03 TFLOPS 64 1024 1.41 TFLOPS 2.80 TFLOPS The bandwidth on the kernel side went from ~50 GB/s to ~92 GB/s on the best configurations, confirming that the V global memory accesses were the dominant stall.\nWhat NCU says now After loading V into shared memory, the new binding constraint is occupancy. With 80 KB of shared memory per block, only 1 block can fit per SM instead of 2. Combined with 128 registers per thread — driven by the large O0 accumulator array the theoretical occupancy stays at 8.33%.\nThe kernel is still latency-bound, but for a different reason. The shared memory budget is now the wall. Reducing it requires either a smaller tile size or a different strategy for the output accumulator. That is the next step.\n21. First Profiling Round: What We Tried and What We Learned Running NCU on the first functional version of the kernel revealed a single dominant metric: No Eligible at 95.23%. The warp scheduler found no eligible warp to execute 95% of the time. The SM was idle almost continuously.\nThe cause was V. The V accumulation loop was reading directly from global memory inside a double loop over output column tiles and softmax weight indices. For HEAD_DIM=128 and a single n_tile, each thread generated 128 uncoalesced global memory accesses. Across 8 n_tiles and multiple seq_tiles, that amounted to thousands of scattered reads per block. The long scoreboard stall confirmed it at 81%.\nThe fix was straightforward: load each V tile into shared memory before the MMA loop, exactly as K is already loaded. This eliminated the global memory dependency during accumulation. The result was roughly a 2x improvement in TFLOPS, and the kernel bandwidth jumped from ~50 GB/s to ~92 GB/s on the best configurations.\nThat introduced a new constraint. Adding a 32 KB V tile to the shared memory budget brought the total to 80 KB per block. With 99 KiB available per SM on SM120, only one block could fit at a time. The occupancy stayed at 8.33% with four active warps per SM instead of the ~32 needed to hide latencies.\nThe next move was to reduce the shared memory footprint. The staging buffer, used to load K as float32 before quantization, was the largest single consumer at 32 KB. By switching it from float32 to __half, the footprint dropped to 16 KB. The same buffer then gets reused for V, eliminating the need for a separate V tile buffer.\nThe precision trade-off is negligible. FP16 has 10 bits of mantissa. FP4 E2M1 has 1. Any rounding introduced by the float32 → float16 → float32 conversion disappears completely in the FP4 quantization step. The test suite confirmed cosine 1.0000 on all configurations after the change.\nThe shared memory budget dropped to ~32.5 KB, which should allow 3 blocks per SM instead of 1. The TFLOPS improved further, reaching 3.1 on head_dim=64, seq_k=1024.\nA comparison against PyTorch SDPA on identical configurations put the gap in perspective. PyTorch FP16 reaches 15 TFLOPS on the same hardware for the same problem size. We are at 3 TFLOPS, roughly 5x slower.\nThe gap is not algorithmic. It is architectural. PyTorch SDPA via FlashAttention receives data already in FP16. No quantization happens inside the kernel. Our kernel quantizes Q, K, and V on the fly at every call, inside the main loop. The quantization pass encode_fp4_e2m1 called 8192 times per tile, with a chain of eight comparisons per call takes roughly as long as the MMA itself. The Tensor Cores are idle most of the time, waiting for the quantization pass to finish.\nThis is the fundamental tension of the current design. On-the-fly quantization keep sthe interface simple: the kernel accepts float32 inputs just like any standard attention kernel. But it means the FP4 Tensor Cores are not the bottleneck the scalar quantization loop is. Closing the gap with PyTorch requires either vectorizing the quantization, or moving it outside the kernel entirely and accepting pre-quantized inputs. Both directions are worth exploring, and that is where the next round of optimization begins.\n22. Deep NCU Analysis: What the SASS Revealed Section 21 identified the core problem: on-the-fly quantization dominates the kernel runtime. But saying \u0026ldquo;quantization is slow\u0026rdquo; is not actionable. I needed to see exactly which instructions were responsible and how much they cost. That meant reading the SASS, the actual machine code the GPU executes.\nI ran ncu --set full on the kernel and exported the source-level report. The kernel compiled to about 5,900 SASS instructions. Four of them were QMMA, the FP4 Tensor Core multiply. The other 5,896 were overhead.\nThe division that was not a division The first thing that stood out was 129 calls to a function called __cuda_sm3x_div_rn_noftz_f32_slowpath. Each call appeared as a real CALL.REL.NOINC instruction in the SASS, with register save/restore and dozens of instructions per invocation.\nThe GPU does not have a hardware division unit. What it has is MUFU.RCP, an instruction that computes the reciprocal 1/b on the Special Function Unit pipe in about one cycle. The result is accurate to roughly 22 bits of mantissa. A float32 has 23. That last bit matters to the IEEE-754 standard, so by default nvcc generates a software routine that starts with MUFU.RCP, then refines the result with two or three rounds of Newton-Raphson. That routine is the \u0026ldquo;slowpath\u0026rdquo;. It is a full function call, not an inlined instruction.\nIn my kernel, the division happened in compute_scale_ue8m0. After finding the maximum absolute value in a block of 32 elements, the code divided each value by the scale factor before encoding to FP4. The scale is a power of two (UE8M0 format), so the division is exact in floating point. IEEE-754 precision on a result that will be rounded to one of eight magnitudes.\nThe fix was to replace the division with a multiply by the inverse: exp2f((float)(127 - scale)) computes 2^(-exponent) exactly, and multiplying is a single FMUL instruction. After the change, the 129 CALL instructions disappeared entirely. The kernel dropped from 5,900 to 4,200 SASS instructions, a 28% reduction.\nI later found a thread on the NVIDIA developer forums where njuffa, a former NVIDIA engineer who designed FPU hardware, confirmed the behavior. The \u0026ldquo;sm3x\u0026rdquo; in the function name is misleading: the routine was written for Kepler (compute capability 3.x) around 2012 and has been reused on every architecture since, including Blackwell. The user in that thread measured a 3x speedup by replacing / with __fdividef.\nThe 647 comparisons for a 32-element max The second problem was more subtle. compute_scale_ue8m0 finds the maximum absolute value across 32 elements using a sequential loop: if (fabsf(block[i]) \u0026gt; max_abs) max_abs = fabsf(block[i]). The compiler translated each iteration into two SASS instructions: FSETP.GT (compare, write predicate) followed by FSEL (conditional select). Two instructions per comparison, 31 comparisons per block, hundreds of blocks per tile. The total was 647 FSETP/FSEL instructions in the kernel.\nThe GPU has an instruction that does this in one step: FMNMX. It takes two values and a predicate that selects min or max, and produces the result in a single cycle on the same ALU pipe. The compiler was not using it because the C++ code was written as an if statement, which the optimizer does not always reduce to FMNMX. Replacing if (a \u0026gt; max_abs) max_abs = a with max_abs = fmaxf(fabsf(block[i]), max_abs) gives nvcc the hint it needs. The fmaxf intrinsic maps directly to FMNMX.\nThis cut the instruction count for the scale computation roughly in half. The effect on total kernel time was smaller than the division fix, because the FSETP/FSEL chain does not stall the pipeline as badly as a function call, but it reduced pressure on the ALU pipe that was already saturated by the quantization work.\nThe byte-by-byte problem The quantization pipeline writes each encoded FP4 byte to shared memory individually with STS.U8, an 8-bit store. The kernel had 66 of them. Each STS.U8 consumes a full shared memory transaction (32 bytes of bandwidth) to write a single byte. Packing four bytes into a uint32_t and writing once with STS.32 would use the same bandwidth for 4x the data.\nThe same pattern appeared on the read side. Before each QMMA, the kernel loaded FP4 operands from shared memory with LDS.U8, one byte at a time, then reconstructed 32-bit registers using chains of IMAD (multiply-accumulate to shift and combine bytes). 104 LDS.U8 instructions, each generating shared memory bank conflicts measured at L1 Wavefronts Shared Excessive = 14,336.\nThese two problems, STS.U8 and LDS.U8, are the largest remaining performance bottleneck. They are a direct consequence of writing FP4 values as individual bytes in C++ rather than packing them into wider words before the store.\nThe full picture After the division fix and the fmaxf change, the SASS profile looked like this:\nCategory Instructions % of total FP4 quantization (encode + scale) ~2,800 66% Data movement (LDG, STS, LDS, STG) ~800 19% Softmax + V accumulation ~400 10% QMMA (Tensor Core compute) 4 0.1% Other (control, sync, address math) ~200 5% The Tensor Cores executed four instructions out of 4,200. Everything else was preparation. The kernel is not compute-bound. It is quantization-bound.\n23. Where the Time Goes and Why the Gap Is Expected PyTorch SDPA with FlashAttention reaches 15 to 16 TFLOPS on the RTX 5070 Ti for the same problem size. This kernel reaches 2.4 to 3.4 TFLOPS, roughly 4 to 5 times slower.\nThe gap is not a bug. It is a design consequence.\nFlashAttention receives Q, K, and V already in FP16. The kernel\u0026rsquo;s main loop is almost entirely MMA instructions and softmax arithmetic. There is no format conversion inside the hot path.\nThis kernel receives Q, K, and V in float32. For every tile of 8,192 elements, it computes 256 block scales (one per 32 elements), finds the absolute maximum of each block, converts each value to the nearest FP4 representation through a chain of eight comparisons, packs the results into shared memory, and only then feeds them to the Tensor Core. That quantization pass runs twice per sequence tile, once for Q and once for K.\nThe quantization is doing useful work. It is not wasted computation. But it is scalar work on data that the Tensor Core will process in a single instruction. The ratio between the two is the gap.\nFor a production inference kernel, the solution is to move the quantization outside. In a decode loop, K and V live in a KV cache that is already quantized to FP4. Q is a single token that can be quantized in a separate, lightweight kernel. The attention kernel itself receives pre-packed uint8 inputs and spends its time on MMA and softmax. That is what SageAttention3 does, and it is the natural next step for this project.\nBut the current kernel was never designed to compete on throughput. It was designed to make every step of the FP4 fused attention pipeline visible: the MMA fragment layout that is not documented, the container format that silently reads the wrong value if you shift by one bit, the scale distribution across lanes that required 32 probing runs to reverse-engineer, the division operator that turns into 129 function calls. None of that is visible in a CUTLASS template. Writing it from scratch with inline PTX was the only way to see it.\n24. What I Would Do Differently Looking back at several months of work, a few things stand out.\nTest with structured inputs first. The weeks I spent guessing the fragment layout produced nothing because I was testing against random data. The identity matrix test from section 11 gave me precise, per-cell information about which lane loaded which position. Both the A and B fragment fixes took less than an hour once the right test existed. Every new MMA instruction variant should be validated with identity matrices before anything else.\nRead the SASS earlier. The division slowpath was invisible at the C++ level. The scale computation looked like a single line of code. It took NCU and the SASS source view to reveal that one line was generating 129 function calls. Profiling should not be the last step. It should happen after every major code change.\nDo not optimize the wrong design. The on-the-fly quantization was never going to be fast. I knew this conceptually from the start, but I kept optimizing around it (vectorized loads, FP16 staging, shared memory reuse) instead of changing the fundamental approach. The optimization that would have mattered most, pre-quantized inputs, was the one I deferred the longest.\nThe fragment layout is the real contribution. The MMA m16n8k32 fragment layout for FP4 E2M1 on SM120 is not documented anywhere in the PTX ISA. The scale distribution across lanes is not documented. The container format (nibble in bits 5-2, not bits 3-0) is mentioned in one sentence in the spec but never shown in a worked example. Figuring this out empirically and publishing it is the part of this project that will be useful to other people writing SM120 kernels. The kernel performance is secondary.\nThe ecosystem is catching up. When I started this project, SM120 support in the open-source stack was minimal. SageAttention3 was 5090-only in practice, FlashInfer had no SM120 path, and vLLM fell back to Marlin for FP4 on consumer Blackwell. By the time I am writing this, all three have added or are adding SM120 support. The gap I set out to fill is closing, which is a good thing. The documentation gap remains open.\nCode: github.com/florianmattana/fp4-fused-attention-sm120\n","permalink":"https://florianmattana.com/posts/fp4-fused-attention-kernel-sm120/","summary":"\u003ch2 id=\"1-why-fp4-fused-attention-on-consumer-blackwell\"\u003e1. Why FP4 Fused Attention on Consumer Blackwell?\u003c/h2\u003e\n\u003cp\u003eThe attention mechanism in transformers scales quadratically with sequence length. On a consumer GPU with 12 GB of VRAM and 672 GB/s of memory bandwidth, that becomes a hard wall very quickly. The interesting thing about the RTX 5070 Ti (SM120, 46 SMs) is the raw throughput the Tensor Cores can deliver:\u003c/p\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003ePrecision\u003c/th\u003e\n          \u003cth\u003eThroughput\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eFP16\u003c/td\u003e\n          \u003ctd\u003e123.5 TFLOPS\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eINT8\u003c/td\u003e\n          \u003ctd\u003e246.9 TFLOPS\u003c/td\u003e\n      \u003c/tr\u003e\n      \u003ctr\u003e\n          \u003ctd\u003eFP4\u003c/td\u003e\n          \u003ctd\u003e~474 TFLOPS\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThat is roughly a 4x advantage going from FP16 to FP4, and since FP4 values are four times smaller, you also move four times less data through memory. On paper, that is a massive win for attention. If you can actually use the FP4 Tensor Cores.\u003c/p\u003e","title":"Building an FP4 Fused Attention Kernel on Consumer Blackwell (SM120) "},{"content":"The Natural Question When you start learning CUDA, you use threadIdx.x, blockIdx.x and blockDim.x like magic variables that always contain the right value. At some point, you naturally start wondering: how are these values computed? Is there a function somewhere in the CUDA runtime that produces them? Can you see the source code behind them?\nThe answer is surprising: there is no code. These values are not the result of a software computation. They come directly from the hardware.\nRegisters, Not Functions The Wrong Intuition On a classic CPU, when you call a function, the processor executes a sequence of instructions, pushes values onto the stack, performs calculations, and returns a result. You might imagine that threadIdx.x works the same way, that some piece of code somewhere in the NVIDIA driver computes \u0026ldquo;you are thread number 42\u0026rdquo; and hands you that value.\nWhat Actually Happens That is not at all what happens. threadIdx, blockIdx and blockDim correspond to special registers physically wired into the GPU\u0026rsquo;s silicon.\nThink of it like a maternity ward. When a baby is born, nobody asks them to go look up their name in an administrative file. A wristband is immediately attached to their wrist with all their information: name, time of birth, identification number. That wristband is physically there from the very first instant. The baby doesn\u0026rsquo;t need to do anything to \u0026ldquo;compute\u0026rdquo; it. This is exactly how special registers work in CUDA. When a thread is \u0026ldquo;born\u0026rdquo; on the GPU, the hardware instantly attaches its coordinates (threadIdx, blockIdx, blockDim) to it. The thread just has to \u0026ldquo;look at its wristband\u0026rdquo; to know who it is.\nA special register is a small piece of memory integrated directly into the processor, whose value is set by the hardware itself, with no software intervention. When your thread reads threadIdx.x, it simply accesses the contents of such a register. There is no function call, no call stack, no intermediate computation. It is a direct, instantaneous read from the hardware.\nHow These Values End Up in the Registers To understand the mechanism, you need to know what happens when a kernel is launched.\nThink of it like a school principal on the first day of school. The principal (the hardware scheduler) has the complete list of students and the class assignments in front of them. Before classes begin, they send each student to their classroom, and at each desk they have already placed a label: \u0026ldquo;You are student number 3 in class B, and your class is in building 2.\u0026rdquo; When the student sits down, they don\u0026rsquo;t need to do anything. The information is already there, on their desk.\nHere is what happens step by step.\nStep 1: the kernel launch. When you write a kernel call with a grid and block configuration, the CUDA runtime transmits this configuration to the GPU. This is the moment when the \u0026ldquo;principal\u0026rdquo; receives their list of classes and students.\nStep 2: distribution across SMs. The hardware scheduler distributes blocks of threads across the different Streaming Multiprocessors (SMs), the GPU\u0026rsquo;s compute units. This is the moment when the principal sends each class to its room.\nStep 3: writing the registers. When a block is assigned to an SM and its threads begin executing, the scheduler physically writes into each thread\u0026rsquo;s special registers the values that correspond to it: its threadIdx, its block\u0026rsquo;s blockIdx, and the blockDim dimensions. This is the moment when each student finds their label on their desk.\nStep 4: the kernel executes. When your code starts running, the values are already there, ready to be read. The thread never needs to \u0026ldquo;discover\u0026rdquo; who it is through a computation. The hardware told it at the moment of its birth.\nTo make this concrete, suppose you launch a kernel with a grid of 4 blocks, each block containing 256 threads. At launch time, the scheduler assigns block 0 to SM number 5, block 1 to SM number 12, and so on. For each thread in block 0, it writes blockIdx.x = 0 in the corresponding register. For thread number 147 of that block, it writes threadIdx.x = 147. All of this happens in hardware, before a single instruction of your kernel executes.\nWhat It Looks Like at the Assembly Level In PTX, the built-in variables we know from CUDA C++ are exposed as special registers with predefined names.\nThe naming might look cryptic at first, but there is a logic to it. %tid simply stands for Thread ID, which maps directly to threadIdx. %ctaid stands for Cooperative Thread Array ID. Internally, NVIDIA doesn\u0026rsquo;t call a block a \u0026ldquo;block.\u0026rdquo; The real technical name is CTA, or Cooperative Thread Array. So %ctaid literally means \u0026ldquo;the ID of the CTA,\u0026rdquo; in other words, which block you are in within the grid. The word \u0026ldquo;block\u0026rdquo; that we use in CUDA C++ is actually a pedagogical simplification. Similarly, %ntid stands for Number of Threads in the CTA (i.e., blockDim).\nWhen the NVCC compiler transforms your CUDA code into PTX, an expression like threadIdx.x simply becomes an instruction to read the %tid.x register. In terms of cost, it is comparable to reading the rax register on an x86 processor: an operation that takes essentially zero extra cycles. There is no memory latency, no cache access, no computation.\nTo give you an order of magnitude: an access to GPU global memory costs between 200 and 800 cycles. An access to shared memory costs around 20 to 30 cycles. An access to a register costs 0 to 1 cycles. Reading threadIdx.x is therefore between 200 and 800 times cheaper than fetching a value from global memory.\nSeeing It for Real You can verify all of this yourself on godbolt.org. Take the simplest possible kernel. The key thing to notice is the mov instructions reading %ctaid.x, %tid.x, %ctaid.y, and %tid.y. No function call. No memory load. Just a direct read from a hardware register. Then the mad (Multiply-Add) instructions do the local-to-global mapping in a single operation each. That\u0026rsquo;s it.\nWhat This Means for the Local to Global Mapping Let\u0026rsquo;s revisit a simple computation:\nint globalX = blockIdx.x * BN + threadIdx.x; int globalY = blockIdx.y * BM + threadIdx.y; In light of what we now know, we can better understand what is actually happening.\nImagine you live in a housing complex. You know two things: your building number (blockIdx) and your apartment number within that building (threadIdx). If each building contains 50 apartments (blockDim), then your global apartment number in the entire complex is building_number * 50 + local_number. Building 3, apartment 12, is global apartment number 3 * 50 + 12 = 162. Both starting pieces of information (building and local number) were already on your wristband. The only real computation is the multiplication and the addition.\nHere\u0026rsquo;s a full worked example. Consider a kernel launched with 16x16 thread blocks on a 64x64 matrix. The thread that has blockIdx = (2, 1) and threadIdx = (5, 9) computes its global coordinates as follows: globalX = 2 * 16 + 5 = 37 (column 37) and globalY = 1 * 16 + 9 = 25 (row 25). This thread now knows it is responsible for element C[25][37] of the matrix. The reads of blockIdx and threadIdx cost 0 cycles (hardware registers). The multiplication and addition cost 1 to 2 cycles. In total, the thread determined its unique global position in fewer than 5 cycles. Compare that to the hundreds of cycles that even a single access to global memory data will cost afterwards.\nOne Step Further to Avoid Confusion A common misconception worth clearing up: when we say threadIdx and blockIdx live in hardware registers, it does not mean there are millions of physical silicon slots for millions of threads. The GPU does not physically instantiate all threads at once.\nWhat actually happens is that each SM has a fixed-size register file (for example, 65,536 registers on an A100). That is real silicon, a fixed physical resource that does not grow. If your kernel uses 32 registers per thread, one SM can hold at most 2,048 threads at a time. Across all 108 SMs of an A100, that gives you roughly 220,000 resident threads at any given moment, not the millions you may have launched.\nThe rest of the threads simply wait in line. When a block finishes on an SM, the scheduler assigns a new block to that same physical space. The special registers (%tid, %ctaid) are overwritten with the new thread\u0026rsquo;s values, just like a hotel room being cleaned and prepared for the next guest.\nSo do not confuse threads launched (logical, potentially millions) with threads resident (physical, limited by the register file). The hardware registers are real silicon, but they are recycled across threads over time. And this is exactly why register pressure matters so much in CUDA: the more registers your kernel uses per thread, the fewer threads the SM can host simultaneously, and the lower your occupancy drops.\nKey Takeaways CUDA\u0026rsquo;s built-in variables (threadIdx, blockIdx, blockDim) are not software. They are hardware registers, filled by the GPU\u0026rsquo;s hardware scheduler the moment each thread is launched. Reading them costs nothing. This is a deliberate architectural decision by NVIDIA to allow tens of thousands of threads to instantly know their identity, with zero overhead. In PTX assembly, they appear as %tid (Thread ID), %ctaid (Cooperative Thread Array ID, because NVIDIA internally calls a block a CTA), and %ntid (number of threads in the CTA). The local-to-global mapping that we build from these registers adds only a multiplication and an addition, a few cycles at most, making it one of the cheapest operations in any kernel.\nOriginally published on LinkedIn.\n","permalink":"https://florianmattana.com/posts/from-silicon-to-thread-identity/","summary":"\u003ch2 id=\"the-natural-question\"\u003eThe Natural Question\u003c/h2\u003e\n\u003cp\u003eWhen you start learning CUDA, you use \u003ccode\u003ethreadIdx.x\u003c/code\u003e, \u003ccode\u003eblockIdx.x\u003c/code\u003e and \u003ccode\u003eblockDim.x\u003c/code\u003e like magic variables that always contain the right value. At some point, you naturally start wondering: how are these values computed? Is there a function somewhere in the CUDA runtime that produces them? Can you see the source code behind them?\u003c/p\u003e\n\u003cp\u003eThe answer is surprising: there is no code. These values are not the result of a software computation. They come directly from the hardware.\u003c/p\u003e","title":"From Silicon to Thread Identity: How CUDA Threads Know Who They Are"},{"content":"Originally, I suggested on LinkedIn that I would write an article to unabstract a PyTorch program and reveal some potential CUDA optimizations we could make by debugging it. I actually took another road, but I kept in mind the desire to dig deeper from a given point of abstraction.\nFor context, I am currently writing a small program to compare some kernels with different optimization choices or intentionally unoptimized kernels to highlight the differences between various implementation options.\nBefore reaching the conclusion by uploading a terminal screenshot from which we already know the results in advance, I intended to explore the trick at a lower level.\nAs you might know, a CUDA program is compiled in two stages. First, the NVCC compiler splits your code into two parts: CPU code and GPU code. The CPU code is compiled in the traditional way, while the GPU code is compiled into PTX, which is a sort of intermediate representation between your source code and what the GPU actually runs. The second compilation step turns PTX into SASS, the actual machine code executed by the GPU. This can happen either during compilation (AOT) or at runtime (JIT). For now, let\u0026rsquo;s focus on PTX itself.\nSo there I was, and I thought it would be interesting to catch the difference between a naive kernel implementation and a Matrix Transpose Optimized Kernel with tiling.\nIn CUDA, a tile is a small block of data that fits into shared memory. This tiling strategy reduces slow global memory accesses by reusing data from the fast shared memory multiple times. At least, that\u0026rsquo;s what we are going to assess.\nIf you want to experiment by yourself I am dropping the generic code here and you can use godbolt.org to get the PTX result:\n// KERNEL A : Matrix Transpose Naive __global__ void matrix_transpose_naive( const float* input, float* output, int width, int height) { int col = blockDim.x * blockIdx.x + threadIdx.x; int row = blockDim.y * blockIdx.y + threadIdx.y; if (col \u0026lt; width \u0026amp;\u0026amp; row \u0026lt; height) { int input_index = row * width + col; int output_index = col * height + row; output[output_index] = input[input_index]; } } // KERNEL B : Matrix Transpose Optimized __global__ void matrix_transpose_optimized( const float* input, float* output, int width, int height) { __shared__ float tile[TILE_SIZE][TILE_SIZE + 1]; int x = blockIdx.x * TILE_SIZE + threadIdx.x; int y = blockIdx.y * TILE_SIZE + threadIdx.y; if (x \u0026lt; width \u0026amp;\u0026amp; y \u0026lt; height) { tile[threadIdx.y][threadIdx.x] = input[y * width + x]; } __syncthreads(); int x_out = blockIdx.y * TILE_SIZE + threadIdx.x; int y_out = blockIdx.x * TILE_SIZE + threadIdx.y; if (x_out \u0026lt; height \u0026amp;\u0026amp; y_out \u0026lt; width) { output[y_out * height + x_out] = tile[threadIdx.x][threadIdx.y]; } } And here are the PTX results we are going to work with:\n// Kernel A PTX .visible .entry matrix_transpose_naive(...) { ld.param.u64 %rd1, [_param_0]; ld.param.u64 %rd2, [_param_1]; ld.param.u32 %r3, [_param_2]; ld.param.u32 %r4, [_param_3]; mov.u32 %r5, %ntid.x; mov.u32 %r6, %ctaid.x; mov.u32 %r7, %tid.x; mad.lo.s32 %r1, %r5, %r6, %r7; mov.u32 %r8, %ctaid.y; mov.u32 %r9, %ntid.y; mov.u32 %r10, %tid.y; mad.lo.s32 %r2, %r9, %r8, %r10; setp.ge.s32 %p1, %r1, %r3; setp.ge.s32 %p2, %r2, %r4; or.pred %p3, %p1, %p2; @%p3 bra $L__BB0_2; cvta.to.global.u64 %rd3, %rd1; mad.lo.s32 %r11, %r2, %r3, %r1; mad.lo.s32 %r12, %r1, %r4, %r2; mul.wide.s32 %rd4, %r11, 4; add.s64 %rd5, %rd3, %rd4; ld.global.f32 %f1, [%rd5]; cvta.to.global.u64 %rd6, %rd2; mul.wide.s32 %rd7, %r12, 4; add.s64 %rd8, %rd6, %rd7; st.global.f32 [%rd8], %f1; $L__BB0_2: ret; } // Kernel B PTX .visible .entry matrix_transpose_optimized(...) { ld.param.u64 %rd1, [_param_0]; ld.param.u64 %rd2, [_param_1]; ld.param.u32 %r9, [_param_2]; ld.param.u32 %r10, [_param_3]; mov.u32 %r11, %ctaid.x; shl.b32 %r1, %r11, 5; mov.u32 %r2, %tid.x; add.s32 %r3, %r1, %r2; mov.u32 %r12, %ctaid.y; shl.b32 %r4, %r12, 5; mov.u32 %r5, %tid.y; add.s32 %r6, %r4, %r5; setp.ge.s32 %p1, %r3, %r9; setp.ge.s32 %p2, %r6, %r10; or.pred %p3, %p1, %p2; @%p3 bra $L__BB0_2; cvta.to.global.u64 %rd3, %rd1; mad.lo.s32 %r13, %r6, %r9, %r3; mul.wide.s32 %rd4, %r13, 4; add.s64 %rd5, %rd3, %rd4; ld.global.f32 %f1, [%rd5]; mov.u32 %r14, tile; mad.lo.s32 %r15, %r5, 132, %r14; shl.b32 %r16, %r2, 2; add.s32 %r17, %r15, %r16; st.shared.f32 [%r17], %f1; $L__BB0_2: bar.sync 0; add.s32 %r7, %r4, %r2; setp.ge.s32 %p4, %r7, %r10; add.s32 %r8, %r1, %r5; setp.ge.s32 %p5, %r8, %r9; or.pred %p6, %p5, %p4; @%p6 bra $L__BB0_4; mov.u32 %r18, tile; mad.lo.s32 %r19, %r2, 132, %r18; shl.b32 %r20, %r5, 2; add.s32 %r21, %r19, %r20; ld.shared.f32 %f2, [%r21]; mad.lo.s32 %r22, %r8, %r10, %r7; cvta.to.global.u64 %rd6, %rd2; mul.wide.s32 %rd7, %r22, 4; add.s64 %rd8, %rd6, %rd7; st.global.f32 [%rd8], %f2; $L__BB0_4: ret; } 1. PTX, how nice you look .visible .entry (rdm_kernel_name) The common first line in both kernels is printed because of the __global__ keyword. This makes the function visible outside the module, much like a public class in traditional programming.\n2. Program is setting up Another common point between both PTX kernels is that each element of the kernel signature is stored in a dedicated register: 64-bit registers for pointers (since they store memory addresses; modern GPUs have large memory, up to tens of gigabytes, and 64-bit addresses can reference up to 16 exabytes; a 32-bit pointer could only address 4GB), and 32-bit registers (4 bytes) for integers.\nOne fun fact you may have noticed is the difference in selected registers between the two kernels for the same operation. This difference in register usage is not arbitrary. It is a direct consequence of compiler-driven register scheduling to support shared memory, tiling, and memory coalescing.\nWhen a kernel uses __shared__ memory, the compiler knows that shared memory will be leveraged to reduce slow global memory accesses and applies additional memory-related optimizations. As a result, it reorganizes register usage to hold indices, offsets, and temporary values needed for shared memory loads and stores.\n3. Thread indices calculation This marks the first divergence between the two kernels.\n// KERNEL A (not optimized) mov.u32 %r5, %ntid.x; mov.u32 %r6, %ctaid.x; mov.u32 %r7, %tid.x; mad.lo.s32 %r1, %r5, %r6, %r7; mov.u32 %r8, %ctaid.y; mov.u32 %r9, %ntid.y; mov.u32 %r10, %tid.y; mad.lo.s32 %r2, %r9, %r8, %r10; // Kernel B (Optimized) mov.u32 %r11, %ctaid.x; shl.b32 %r1, %r11, 5; mov.u32 %r2, %tid.x; add.s32 %r3, %r1, %r2; mov.u32 %r12, %ctaid.y; shl.b32 %r4, %r12, 5; mov.u32 %r5, %tid.y; add.s32 %r6, %r4, %r5; These two code snippets calculate thread indices in different ways, revealing an important assumption about block dimensions. The difference is:\nThe first version is flexible, works with any block dimensions, but requires reading blockDim and using a multiply-add instruction. The second version is faster (one less instruction, uses shift instead of multiply) but only works if blocks are 32x32.\nFirst kernel: move the block dimension in x (number of threads per block) into register r5, move the block ID in x-dimension into register r6, move the thread ID within block into register r7, then multiply r5 by r6, add r7, and store result in r1. This uses the actual runtime block dimension instead of a hardcoded value. It translates the formula: globalX = blockDim.x * blockIdx.x + threadIdx.x.\nSecond kernel: move the block ID in x-dimension into register r11, shift r11 left by 5 bits and store in r1 (which multiplies by 32), move the thread ID within block into register r2, add r1 and r2 and store result in r3. The shift left by 5 bits multiplies by 2^5 = 32. This assumes the block size is exactly 32, hardcoded at compile time because of the tile size we defined to 32. The formula becomes: globalX = blockIdx.x * 32 + threadIdx.x.\nA left shift is cheaper than a multiplication because it is implemented as simple bit rewiring with minimal hardware and latency, while multiplication requires complex arithmetic logic, deeper pipelines, and higher scheduling and energy costs, even when both appear as a single instruction.\n4. Index boundaries // Kernel A (naive) setp.ge.s32 %p1, %r1, %r3; setp.ge.s32 %p2, %r2, %r4; or.pred %p3, %p1, %p2; @%p3 bra $L__BB0_2; // Kernel B (optimized) setp.ge.s32 %p1, %r3, %r9; setp.ge.s32 %p2, %r6, %r10; or.pred %p3, %p1, %p2; @%p3 bra $L__BB0_2; The structure is identical when it comes to preventing indices from going out of bounds. We can just observe that different registers are used (r3 vs r9, r6 vs r10) in order to reduce register pressure, enable better instruction scheduling, and improve instruction-level parallelism.\n5. Calculation time 5.1 Kernel A: coalesced read, uncoalesced write If we want to identify the moment when performance diverges even further, we need to check whether the memory access is coalesced. As a quick reminder, memory access is coalesced when threads in a warp access consecutive addresses. For float data, each address should be 4 bytes apart.\n// Read part (COALESCED) cvta.to.global.u64 %rd3, %rd1; mad.lo.s32 %r11, %r2, %r3, %r1; mul.wide.s32 %rd4, %r11, 4; add.s64 %rd5, %rd3, %rd4; ld.global.f32 %f1, [%rd5]; // Write part (UNCOALESCED) mad.lo.s32 %r12, %r1, %r4, %r2; cvta.to.global.u64 %rd6, %rd2; mul.wide.s32 %rd7, %r12, 4; add.s64 %rd8, %rd6, %rd7; st.global.f32 [%rd8], %f1; You can tell whether memory access is coalesced just by looking at the PTX. Check how threadIdx.x changes the address: if each thread increments the address by the size of one element (4 bytes for a float), the access is coalesced. If each thread jumps by a much larger number, like the matrix height, the access is not coalesced.\nRead part: mad.lo.s32 %r11, %r2, %r3, %r1. Here %r1 corresponds to threadIdx.x and %r2 * %r3 computes a constant base offset (row * width). threadIdx.x is added directly to the index, giving index(thread i) = constant(r2 * r3) + i. Each thread reads the next element in memory. Coalesced.\nWrite part: mad.lo.s32 %r12, %r1, %r4, %r2. Here %r1 = threadIdx.x, %r4 = height. This calculates index(thread i) = threadIdx.x * height + row. Each thread jumps by height elements instead of 1. The addresses are not consecutive, the write is uncoalesced.\n5.2 Kernel B: the optimized write through shared memory mov.u32 %r14, tile; mad.lo.s32 %r15, %r5, 132, %r14; shl.b32 %r16, %r2, 2; add.s32 %r17, %r15, %r16; st.shared.f32 [%r17], %f1; My first surprise was to see tile as it was never declared in PTX. I then understood that tile is a __shared__ memory variable declared in the original CUDA code and it doesn\u0026rsquo;t appear explicitly as a standard variable in the PTX code.\nBut the most important part is the 132 on line 2. In our code, we have __shared__ float tile[32][33]. Considering that each float is 4 bytes, the second dimension is 33 (not 32), and the stride is the distance in bytes between the start of one row and the next:\nstride = 33 floats * 4 bytes/float = 132 bytes\nThis extra float (33 instead of 32) is used to avoid shared memory bank conflicts.\n6. Thread synchronization An important divergent point is about thread synchronization, which is mandatory when we use shared memory.\n$L__BB0_2: bar.sync 0; When we use a tile in shared memory, all threads in a block write their part of the data into that tile. However, threads may execute at slightly different speeds. If a thread tries to read from the tile before another thread has finished writing, it could read wrong or incomplete data.\nThat\u0026rsquo;s why we use a synchronization barrier (__syncthreads() in CUDA, bar.sync 0 in PTX). It makes all threads wait until everyone has finished writing to the shared memory. After the barrier, it is safe for all threads to read from the tile.\n7. Wrap up Exploring a PTX file is a great way to better understand CUDA code. It is a good exercise to become a stronger engineer. That said, PTX is just a low-level translation of your higher-level code. Nothing will appear in PTX that you didn\u0026rsquo;t already write.\nOriginally published on LinkedIn.\n","permalink":"https://florianmattana.com/posts/exploring-ptx-tile-optimization/","summary":"\u003cp\u003eOriginally, I suggested on LinkedIn that I would write an article to unabstract a PyTorch program and reveal some potential CUDA optimizations we could make by debugging it. I actually took another road, but I kept in mind the desire to dig deeper from a given point of abstraction.\u003c/p\u003e\n\u003cp\u003eFor context, I am currently writing a small program to compare some kernels with different optimization choices or intentionally unoptimized kernels to highlight the differences between various implementation options.\u003c/p\u003e","title":"Exploring PTX: A Close Look at Tile Optimization in CUDA"},{"content":"I\u0026rsquo;m Florian Mattana, GPU kernel engineer based in France.\nI write CUDA kernels at the PTX level for LLM inference: fused attention, quantized GEMM, online softmax. You can read about the technical work on the blog and on GitHub.\nI got into GPU computing through a weird path. Started in finance (Sorbonne master\u0026rsquo;s), then built a crypto mining rig around 2015 and got hooked on understanding why memory bandwidth matters more than clock speed. That led to production CUDA work at Geopost, Airbus, and Melexis, and eventually to writing inference kernels full time.\nI\u0026rsquo;ve lived in five countries (South Korea, Spain, France, the UK, and Russia), which taught me how to work with anyone, adapt fast, and communicate across cultures and time zones. I hold PMP and Agile certifications from years of shipping under heavy production constraints, so I know how to scope work, hit deadlines, and push back when a plan doesn\u0026rsquo;t make sense.\nWhen I\u0026rsquo;m not staring at NCU reports, I\u0026rsquo;m watching RC Lens lose in creative ways, rewatching Arcane for the fourth time, or playing Hunt: Showdown. I like building things from scratch, understanding how they work at every level, and explaining what I learned along the way.\nContact If you want to discuss GPU kernel work, inference optimization, or have a project where low-level CUDA expertise would help, reach out on LinkedIn or Twitter.\nI\u0026rsquo;m open to kernel engineering roles at companies working on inference, GPU compilers, or high-performance computing. Full remote or relocation, EU citizen. If you\u0026rsquo;re hiring for that, let\u0026rsquo;s talk.\n","permalink":"https://florianmattana.com/about/","summary":"\u003cp\u003eI\u0026rsquo;m Florian Mattana, GPU kernel engineer based in France.\u003c/p\u003e\n\u003cp\u003eI write CUDA kernels at the PTX level for LLM inference: fused attention, quantized GEMM, online softmax. You can read about the technical work on the \u003ca href=\"https://florianmattana.com/posts/\"\u003eblog\u003c/a\u003e and on \u003ca href=\"https://github.com/florianmattana\"\u003eGitHub\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eI got into GPU computing through a weird path. Started in finance (Sorbonne master\u0026rsquo;s), then built a crypto mining rig around 2015 and got hooked on understanding why memory bandwidth matters more than clock speed. That led to production CUDA work at Geopost, Airbus, and Melexis, and eventually to writing inference kernels full time.\u003c/p\u003e","title":""}]